47. Example: predicting CTR (search ads)
Rank = bid * CTR
Predict CTR for each ad to
determine placement, based on:
- Historical CTR
- Keyword match
- Etc…
Approach: supervised learning
49. MapReduce
• MapReduce is a distributed computing programming model
• It works like a Unix pipeline:
– cat input | grep | sort | uniq -c > output
– Input | Map | Shuffle & Sort | Reduce |
Output
• Strengths:
– Easy to use! Developer just writes a couple of
functions
– Moves compute to data
• Schedules work on HDFS node with data if possible
– Scans through data, reducing seeks
– Automatic reliability and re-execution on failure
49
49
I want to thank Chris for inviting me here today.Chris and team have done a number of projects with Hadoop.They are a great resource for Big Data projects.Chris is an Apache Board member and was a contributor to Hadoop even before we spun it out of the Nutch project.
As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring
Hadoop started to enhance SearchScience clusters launched in 2006 as early proof of conceptScience results drive new applications -> becomes core Hadoop business
At Hortonworks today, our focus is very clear: we Develop, Distribute and Support a 100% open source distribution of Enterprise Apache Hadoop.We employ the core architects, builders and operators of Apache Hadoop and drive the innovation in the open source community.We distribute the only 100% open source Enterprise Hadoop distribution: the Hortonworks Data PlatformGiven our operational expertise of running some of the largest Hadoop infrastructure in the world at Yahoo, our team is uniquely positioned to support youOur approach is also uniquely endorsed by some of the biggest vendors in the IT marketYahoo is both and investor and a customer, and most importantly, a development partner. We partner to develop Hadoop, and no distribution of HDP is released without first being tested on Yahoo’s infrastructure and using the same regression suite that they have used for years as they grew to have the largest production cluster in the worldMicrosoft has partnered with Hortonworks to include HDP in both their off-premise offering on Azure but also their on-premise offering under the product name HDInsight. This also includes integration with both Visual Studio for application development but also with System Center for operational management of the infrastructureTeradata includes HDP in their products in order to provide the broadest possible range of options for their customers
Tell inception story, plan to differentiate Yahoo, recruit talent, insure that Y! was not built on legacy private systemFrom YST
I want to thank Chris for inviting me here today.Chris and team have done a number of projects with Hadoop.They are a great resource for Big Data projects.Chris is an Apache Board member and was a contributor to Hadoop even before we spun it out of the Nutch project.
Archival use case at big bank:10K files a day == 400GBNeed to store all in EBCDIC format for complianceNeed to also convert to Hadoop for analyticsCompute a checksum for every record and keep a tally of which primary keys changed each dayAlso, bring together financial, customer, and weblogs for new insightsShare with Palantir, Aster Data, Vertica, Teradata, and more…Step One: Create tables or partitionsIn Step one of the dataflow the mainframe or another orchestration and control program notifies HCatalog of its intention to create a table or add a partition if the table exists. This would use standard SQL data definition language (DDL) such as CREATE TABLE and DESCRIBE TABLE (see http://incubator.apache.org/hcatalog/docs/r0.4.0/cli.html#HCatalog+DDL). Multiple tables need to be created though. Some tables are job-specific temporary tables while other tables need to be more permanent. Raw format storage data can be stored in an HCat table, partitioned by some date field (month or year, for example). The staged record data will most certainly be stored in HCatalog partitioned by month (see http://incubator.apache.org/hcatalog/docs/r0.4.0/dynpartition.html). Then any missing month in the table can be easily detected and generated from the raw format storage on the fly. In essence, HCatalog allows creation of tables which up-levels this architectural challenge from one of managing a bunch of manually created files and a loose naming convention to a strong yet abstract table structure much like a mature database solution would have.Step Two: Parallel IngestBefore or after tables are defined in the system, we can start adding data to the system in parallel using WebHDFS or DistCP. In the Teradata-Hortonworks Data Platform these architectural components work seamlessly with the standard HDFS namenode to notify DFS clients of all the datanodes to which to write data. For example, a file made up of 10,000 64 megabyte blocks could be transferred to a 100-node HDFS cluster using all 100 nodes at once. By asking WebHDFS for the write locations for each block, a multi-threaded or chunking client application could write each 64MB block in parallel, 100 blocks or more at a time, effectively dividing the 10,000-block into 100 waves of copying. 100 copy waves would complete 100 times faster than 10,000 one-by-one block copies. Parallel ingest with HCatalog, WebHDFS and/or DistCP will lead to massive speed gains.Critically, the system can copy chunked data directly into partitions in pre-defined tables in HCatalog. This means that each month, staged record data can join the staging tables without dropping previous months and staged data can be partitioned by month while each month itself is loaded using as many parallel ingest servers as solution architecture desires to balance cost with performance.Step Three: Notify on UploadNext, the Parallel ingest system needs to notify the HCatalog engine the files have been uploaded and, simultaneously, any end user transformation or analytics workload waiting for the partition need to be notified that the file is ready to support queries. By “ready” we mean that the partition is whole and is completely copied into HDFS. HCatalog has built in blocking and non-blocking notification APIs that use standard message buses to notify any interested parties that workload—be it MapReduce or HDFS copy work—is complete and valid (see: http://incubator.apache.org/hcatalog/docs/r0.4.0/notification.html). The way this system works is any job created through HCatalog is acknowledged with an output location. The messaging system later replies that a job is complete and since, when the job was submitted, the eventual output location was returned, the calling application can immediately return to the target output file and find its needed data. In this next-gen ETL use case, we will be using this notification system to immediately fire a Hive job to begin transformation whenever a partition is added to the raw or staged data tables. This will make the construction of systems that depend on these transformations easier in that these systems needn’t poll for data nor do those dependent systems need to hard-code file locations for sources and sinks of data moving through the dataflow.Step Four: Fire Off UDFsSince HCatalog can notify interested parties in the completion of file I/O tasks, and since Hcatalog stores file data underneath abstracted table and partition names and locations, invoking the core UDFs that transform mainframe’s data into standard SQL data types can be programmatic. In other words, when a partition is created and the data backing it fully loaded into HDFS, a persistent Hive client can wake up, being notified of the new data and grab that data to load into Teradata. Step Five: Invoke Parallel Transport (Q1, 2013)Coming in the first quarter of 2013 or soon thereafter, Teradata and Hortonworks Data Platform will communicate using Teradata’s parallel transport mechanism. This will provide the same performance benefits as parallel ingest but for the final step in the dataflow. For now, systems integrators and/or Teradata and Hortonworks team members can implement a few DFS clients to load chunks or segments of the table data into Teradata in parallel.
Example: hi tech surveys, customer sat and product satSurveys have multiple-choice and freeformInput and analyze the plain-text sectionsJoin cross-channel support requests and device telemetry back to customerAnother example: wireless carrier and “golden path”
Example: retail custom homepageClusters of related productsSet up models in Hbase that influence when user behaviors trigger recommendationsOR inform users when they enter of custom recommendations
Community developed frameworksMachine learning / Analytics (MPI, GraphLab, Giraph, Hama, Spark, …)Services inside Hadoop (memcache, HBase, Storm…)Low latency computing (CEP or stream processing)
Community developed frameworksMachine learning / Analytics (MPI, GraphLab, Giraph, Hama, Spark, …)Services inside Hadoop (memcache, HBase, Storm…)Low latency computing (CEP or stream processing)
Hortonworks SandboxHortonworks accelerates Hadoop skills development with an easy-to-use, flexible and extensible platform to learn, evaluate and use Apache HadoopWhat is it: virtualized single-node implementation of the enterprise-ready Hortonworks Data PlatformProvides demos, videos and step-by-step hands-on tutorialsPre-built partner integrations and access to datasetsWhat it does: Dramatically accelerates the process of learning Apache HadoopSee It -- demos and videos to illustrate use casesLearn It -- multi level step by step tutorials Do It -- hands on exercises for faster skills developmentHow it helps: Accelerate and validates the use of Hadoop within your unique data architectureUse your data to explore and investigate your use casesZERO to big data in 15 minutes
But beyond Core Hadoop, Hortonworkers are also deeply involved in the ancillary projects that are necessary for more general usage.As you can see, in both code count as well as committers, we contribute more than any others to Core Hadoop. And for the other key projects such as Pig, Hive, Hcatalog, Ambari we are doing the same.This community leadership across both core Hadoop and also the related open source projects is crucial in enabling us to play the critical role in turning Hadoop into Enterprise Hadoop.
So how does this get brought together into our distribution? It is really pretty straightforward, but also very unique:We start with this group of open source projects that I described and that we are continually driving in the OSS community. [CLICK] We then package the appropriate versions of those open source projects, integrate and test them using a full suite, including all the IP for regression testing contributed by Yahoo, and [CLICK] contribute back all of the bug fixes to the open source tree. From there, we package and certify a distribution in the from of the Hortonworks Data Platform (HDP) that includes both Hadoop Core as well as the related projects required by the Enterprise user, and provide to our customers.Through this application of Enterprise Software development process to the open source projects, the result is a 100% open source distribution that has been packaged, tested and certified by Hortonworks. It is also 100% in sync with the open source trees.
As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring