We are a managed service AND a solution provider of elite database and System Administration skills in Oracle, MySQL and SQL Server
Trying to separate data into “small” and “big” is not a useful segmentation. Lets look at structure, processing and data sources instead.
I am going to show a lot of examples of how Hadoop is used to store and process data. I don’t want anyone to tell me “But I can do it in Oracle”. I know you can, and so can I. But there is no point in using Oracle where is it less efficient than other solutions or when it doesn’t have any specific advantage.
Big data is not called big data because it fits well into a thumb-drive.It requires a lot of storage, partially because it’s a lot of data. Partially because it is unstructured, unprocessed, un-aggregated, repetitive and generally messy
The ideas are simple:Data is big, code is small. It is far more efficient to move the small code to the big data than vice versa. It you have a pillow and a sofa in a room, you typically move the pillow to the sofa, and not vice versa. But many developers are too comfortable with the select-than-process anti-pattern. This principle is in place to help with the throughput challenges.2. Sharing is nice, but safe sharing of data typically mean locking, queueing, bottle necks and race conditions. It is notoriously difficult to get concurrency right, and even if you do – it is slower than the alternative. Hadoop works around the whole thing. This principle is in place to deal with parallel processing challenges
Default block size is 64M,You can place any data file in HDFS. Later processing can find the meaning in the data.
Many projects fail because people imagine a very rosy image of Hadoop. They think they can just throw all the data there and it will magically and quickly become value. Such misguided expectations also happen with other platforms and doom other projects too. To be successful with Hadoop, we need to be realistic about it.
Note that while this presentation shows use cases where Hadoop is used to enhance the enterprise data warehouse, there are many many examples where Hadoop is the backend of an entire product. Search engines and recommendation engines are such examples. While those are important use cases, they are out of scope for this presentation.
Much more controversial, especially when going from Oracle to Oracle. The data is clearly structured, so why can’t we use RDBMS (either at OLTP or DW side) to process it?Hadoop makes sense:When structured data needs integration with unstructured data before loadingWhen the ETL part of the process doesn’t scale. If 24h of data processing take more than 24h, the choice is either a bigger database or more Hadoop nodes. If the rest of the database workload scales, Hadoop is an attractive option.When Hadoop replaces homegrown Perl file processing system. It is more centralized, easier to manage and scale better.
There is a lot of interesting data that is not generated by your company.Listings of businesses in specific locations.Connections in social mediaThe data may be un-structured, semi-structured or even structured. but it isn’t structured in the way your DWH expects and needs.We need a landing pad for cleanup, pre-processing, aggregating, filtering and structuring.Hadoop is perfect for this.Mappers can scrape data from websites efficiently.Map-reduce jobs that cleanup and process the data.And then load the results into your DWH.
We want the top 3 items bought by left handed women between ages of 21 and 23, on November 15, 1998.How long it will take you to answer this question? For one of my customers, the answer is 25 minutes.As data grows older, it usually becomes less valuable to the business, and it gets aggregated and shelved off to tapes or other cheap storage. This means that for many organizations, answering details questions about events that happened more than few month ago is impossible or at least very challenging. The business learned to never ask those questions, because the answer is “you can’t”.Hadoop combines cheap storage and massive processing power, this allows us to store detailed history of our business, and to generate reports about it. And once the answer for questions about history is “You will have your data in 25 minutes” instead of “impossible”, the questions turn out to be less rare than we assumed.
7 Petabytes of log file data3 lines point to the security hole that allowed a break-in last weekYour DWH has aggregated information from the logs. Maybe.Hadoop is very cost effective about storing data. Lots of cheap disks, easy to throw data in without pre-processing.Search the data when you need it.
Pythian is a remote DBA company. Many customers feel a bit anxious when they let people they haven’t even met into their most critical databases.One of the ways Pythian deals with this problem is by continuously recording the screen of the VM that DBAs use to connect to customer environments. Our customers have access to those videos and can replay them to check what the DBAs were doing.Our system also allows text search in the video. Perhaps you want to know if we ever issued “drop table” on the system before a critical table disappeared? Or perhaps you want to see how we handled ORA-4032 so you can learn how to do it yourself in the future?OCR of screen video capture from Pythian privileged access surveillance systemFlume streams raw frames from video captureMap-Reduce job runs OCR on frames and produces textMap-Reduce job identifies text changes from frame to frame and produces text stream with timestamp when it was on the screenOther Map-Reduce jobs mine text (nd keystrokes for insightsCredit Cart patternsSensitive commands (like DROP TABLE)Root accessUnusual activity patterns
Wake up everyone, this is the meat and potatoes of the presentation. How do we integrate the Hadoop potatoes with the DWH meat?
It is often said that the best way to succeed is to avoid failure for long enough. Here is some advice that will help your Hadoop projects avoid failure.
* If the data is structured, especially if it arrives from a relational database, it is highly likely that a relational database will process it more efficiently than Hadoop. After all, RDBMS were built for this, with many features to support data processing tasks.OLTP workload don’t work with Hadoop at all. Just don’t try.Anything real-time will not work well with Hadoop.Most BI tools don’t integrate with Hadoop at the moment.
Taking an ETL process that used to be in RDBMS and dropping it on Hadoop by exporting whole tables with Sqoop and using Hive to process the data is unlikely to be any faster. Getting value out of Hadoop involves evaluating the work, understanding bottlenecks and finding the most efficient solution. Either with Hadoop, relational database or both.
Bad schema design is not big dataUsing 8 year old hardware is not big dataNot having purging policy is not big dataNot configuring your database and operating system correctly is not big dataPoor data filtering is not big data eitherKeep the data you need and use. In a way that you can actually use it.If doing this requires cutting edge technology, excellent! But don’t tell me you need NoSQL because you don’t purge data and have un-optimized PL/SQL running on 10-yo hardware.