Big data rmoug

We are a managed service AND a solution provider of elite
database and System Administration skills in Oracle, MySQL and
SQL Server

3

Big Data is a marketing term, like cloud. All kinds of databases get called “Big
Data”. But in order for “Big Data” to define a set of solution architectures, we
need to define the problem we are solving.
The first requirement from Big Data data store is to be a good fit to store large
volumes of data. Large volume can also mean different things to different
people, but if you have less than 5T of data, you’ll need to work hard to
convince me that you need a Big Data solution.
Second requirement is variety – it refers to the need to store not just short
strings and numbers, but also long texts from emails, web log files, XML,
images and video. It also refers to requirements around frequent changes in
the data types stored. Some databases deal with schema changes better
than others.
Velocity – Require storing or serving data very quickly even under highly
concurrent load. The data store should minimize overhead and locking.
Value – the data requires large amount of processing in order to extract
business value, and the data store should support this.
Visualization – when the amounts of data are huge, new techniques for
extracting value are required, and data visualization is gaining prominence
as a method of data exploration. Big Data solutions should be well integrated
with visualization solutions.

4

One of the main reasons for the explosion of data stored in the last few
years is that many problems are easier to solve if you apply more data
to them.

Take the Netflix Challenge for example. Netflix challenged the AI
community to improve the movie recommendations made by Netflix to
its customers based on a database of ratings and viewing history.
Teams that used the available data more extensively did better than
teams that used more advanced algorithms on a smaller data set.

More data also allows businesses to make better, more informed
decisions. Why have focus groups to decide on new store design, if
you can re-design several stores and compare how customers
proceeded through each store and how many left without buying? On-
line stores make the process even easier.

Modern businesses become more scientific and metrics driven, and rely
less on “gut feeling” as the cost of making business experiments and
measuring the results decrease.

6

Data also arrives in more forms and from more sources than
ever. Some of these don’t fit into a relational database very
well, and for some, the relational database does not have
the right tools to process the data.
One of Pythian’s customers analyses social media sources and
allow companies to find comments of their performance and
service and respond to complaints via non-traditional
customer support routes.
Storing facebook comments and blog posts in Oracle for later
processing, results in most of the data getting stored in
BLOBs, where it is relatively difficult to manage. Most of the
processing is done outside of Oracle using Nature Language
Processing tools. So, why use Oracle for storage at all? Why
not store and process the documents elsewhere and only
store the ready-to-display results in Oracle?

7

Companies like Infochimps sell organized public information
that can be used the data collected by the business itself.
This is mostly geographically based information such as
houses for sale, local businesses, community surveys and
even petroleum reports. Such information can be valuable
for marketing departments and the information is not only
for sale, it is accessible through programmable API so new
data can arrive on-the-fly on regular basis to your data
center.
In general, the trend is that businesses use more and more
data that did not originate within the company – whether
tweets or purchased data. This means that the business has
little control over the format of the data as it arrives, and
the format can change overnight.

8

Data, especially from outside sources is not in a perfect condition to be
useful to your business.
Not only does it need to be processed into useful formats, it also needs:
• Filtering for potentially useful information. 99% of everything is crap
• Statistical analysis – is this data significant?
• Integration with existing data
• Entity resolution. Is “Oracle Corp” the same as “Oracle” and “Oracle
Corporation”?
• De-Duplication

Good processing and filtering of data can reduce the volume and variety
of data. It is important to distinguish between true and accidental
variety.

This requires massive use of processing power. In a way, there is a trade-
off between storage space and CPU. If you don’t invest CPU in filtering,
de-duping and entity resolution – you’ll need more storage.

9

• Bad schema design is not big data
• Using 8 year old hardware is not big data
• Not having purging policy is not big data
• Not configuring your database and operating system
correctly is not big data
• Poor data filtering is not big data either

Keep the data you need and use. In a way that you can
actually use it.
If doing this requires cutting edge technology, excellent! But
don’t tell me you need NoSQL because you don’t purge data
and have un-optimized PL/SQL running on 10-yo hardware.

10

The new volume of data, and the need to transform it, filter it
and clean it up require:
1. Not only more storage, but also faster access rates
2. Reliable storage. We want high availability and resilient
systems
3. You also need access to as many cores as you can get, to
process all this data
4. These cores should be as close to the data as possible to
avoid moving large amounts of data on the net
5. The architecture should allow to use many of the cores in
parallel for data processing

11

Data warehouses require the data to be structured in a certain way, and
it has to be structured that way before the data gets into the data
warehouse. This means that we need to know all the questions we
would like to answer with this data when designing the schema for the
data warehouse.

This works very well in many cases, but sometimes there are issues:
• The raw data is not relational – images, video, text and we want to
keep raw data for future use
• The requirements from the business frequently change

In these cases it is better to store the data and create patterns from it as
it is parsed and processed. This allows the business to move from large
up-front design to just-in-time processing.

For example: Astrometry project searches Flickr for photos of night sky,
identifies the part of the sky its from and the prominent celestial bodies
and creates a standard database of the position of elements in the sky.

12

Hadoop is the most common solution for the new Big Data
requirement. It’s a scalable distributed file system, and a
distributed job processing system on top of the file system.
This lets companies keep massive amounts of unstructured
data and efficiently process it. The assumption behind
Hadoop is that most jobs will want to scan entire data sets,
not specific rows or columns. So efficient access to specific
data is not a core capability.

Hadoop is open source, and there is a large eco-system of
tools, products and appliances built around it.
Open source tools that make data processing on Hadoop easier
and more accessible, BI and integration products, improved
implementations of Hadoop that are faster or more reliable,
Hadoop cloud services and hardware appliances.

13

Divide the job into many small tasks, each operating on a
separate set of data.
Run the task on the machine with the data. If one machine is
busy, we can find another with same data. Machines and
tasks are constantly monitored.
Move programs around, not data.
If things are still too slow, more servers (with more disks,
allow more data replication) and more cores are added.

15

Modern data centers generate huge amounts of logs from
applications and web services.
These logs contain very specific information about how users
are using our application and how the application performs.
Hadoop is often used to answer questions like:
• How many users use each feature in my site?
• Which page do users usually go to after visiting page X?
• Do people return more often to my site after I made the
new changes?
• What use patterns correlate with people who eventually
buy a product?
• What is the correlation between slow performance and
purchase rates?

Note that the web logs can be processed, loaded into RDBMS
and parsed there. However, we are talking about very large
amounts of data, and each piece of data needs to be read just
once to answer each question. There are very few relations
there. Why bother loading all this to RDBMS?

16

Hadoop has large storage, high bandwidth, lots of cores and
was build for data aggregation.
Also, it is cheap.
Data is dumped from the OLTP database (Oracle or MySQL) to
Hadoop. Transformation code is written on Hadoop to
aggregate the data (this is the tricky part) and the data is
loaded to the data warehouse (usually Oracle).
This is such a common use case that Oracle built an appliance
especially for this.

17

A lot of the modern web experience revolves around websites being
about to predict what you’ll do next or what you’d like to do but don’t
know about yet.
• People you may know
• Jobs you may be interested in
• Other customers who looked at this product eventually bought…
• These emails are more important than others

To generate this information, usage patterns are extracted from OLTP
databases and logs, the data is analyzed, and the results are loaded to
an OLTP database again for use by the customer.

The analysis task started out as daily batch job, but soon users expected
more immediate feedback.
More processing resources were brought in to speed up the process.
Then the system started incorporating customer feedback into the
analysis when making new recommendations. This new information
needed more storage and more processing power.

18

The best use cases for Hadoop is either storing large amounts
of unprocessed data, or off-loading computationally intensive
tasks away from expensive Oracle cores.

19

Businesses want to be able to respond to events automatically
and immediately.
This usually means comparing current information to historical
data and responding to trends and outliers immediately.
This means speeding up the rates at which data arrives, is
stored and processed and at which the results are served.

21

Recommendation system is an excellent example of how big
data brings value to business.

You get customers to buy more, by processing more data with
smarter analysis.
And they are iterative feedback systems.

The same idea can work within the organization – the
recommendations can be on business decisions to
executives, not necessarily for external customers.

Different tools can be used – analysis of relationship graphs,
correlations between past purchases and clustering of
products and customers to groups with similar attributes.

26

Example stolen from Greg Rahn to show why a chart is a
powerful data exploration tool for big data.

31

Oracle’s Big Data machine was built to move data between
Oracle RDBMS and Hadoop fast, and I doubt if anyone can
beat Oracle at that.
Both the tools that are bundled with the machine and the fast
IB connection to Exadata make it very attractive for
businesses wishing to use Hadoop as ETL solution. Note that
the tools should also be avba

38

Big data rmoug

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Big data rmoug

Semelhante a Big data rmoug (20)

Mais de Gwen (Chen) Shapira

Mais de Gwen (Chen) Shapira (20)

Último

Último (20)

Big data rmoug