12. Today’s Agenda
Wednesday, June 26
8:30 – 10:50 Keynote – Plenary Session
10:50 – 11:20 Break in Community Showcase
11:20 – 12:00 Breakout Sessions
12:50 – 2:05 Lunch
2:05 – 5:35 Breakout Sessions with Breaks
5:35 – 7:00 Exhibitor Reception
7:00 – 10:30 Hadoop Summit Party at The Tech Museum of Innovation
Notas do Editor
1.0Architected for the Large Web Properties to; Hadoop 2.0 represents the next generation of the foundation of big data. Under development for nearly three years now, It is a more mature version of Hadoop that has been architected for broader use by more generic enterprise. The main focus for this nest generation has been the broader enterprise. They have very explicit requirements that are a little bit different than the typical web properties who first adopted hadoop. Some of the requirements required the community to rethink the approach. Plus, our experience running hadoop at yahoo provided much insight into how we could architect things to make them better.Some of the critical features are listed here. Go through them.Highlight workloads and explain how 2.0 is engineered to meet these exacting demands. There is a graphic to help illustrate. We have moved beyond just batch…
Once data is stored into Hadoop to run analytics you have two options – run batch processes ( MR job or Pig) or do interactive querring using Hive..To respond to this Hortonworks have launched the Stinger Initative which aims to make Hive 100x faster through a combination ofMore intelligent query optimization in HiveA modern persistence format called ORCFileAnd finally an execution engine called Tez which enables true interactive data processing on HadoopWhat Causes Latency in Hive?Sub-optimal queries on some join types.Checkpointing to HDFS even when not needed.Stored data not optimized for read.Non-optimized operations for aggregations, projections, etc.High job startup time.Hive 0.11 includes Tez integration
Once data is stored into Hadoop to run analytics you have two options – run batch processes ( MR job or Pig) or do interactive querring using Hive..To respond to this Hortonworks have launched the Stinger Initative which aims to make Hive 100x faster through a combination ofMore intelligent query optimization in HiveA modern persistence format called ORCFileAnd finally an execution engine called Tez which enables true interactive data processing on HadoopWhat Causes Latency in Hive?Sub-optimal queries on some join types.Checkpointing to HDFS even when not needed.Stored data not optimized for read.Non-optimized operations for aggregations, projections, etc.High job startup time.Hive 0.11 includes Tez integration
Everybody’s adopting Hadoop as a data processing platform because it accepts any kind of data and can process at almost any scale.But, as people adopt Hadoop and throw all this data on they start to find other challenges. For example how do you ensure data is being processed reliably? How do you know I’m not keeping data that is too old? If you process data globally, how do you deal with multi-datacenter replication?The challenge the tools that exist for Hadoop including tools like Oozie, Distcp and others operate at a very low level, so you need expert developers to build and test data processing solutions. This sort of custom development takes a lot of time and money and is error prone since you deal at such a low level.Still everybody does it this way because there aren’t real alternatives. I see a lot of people who use custom scripts to delete files when they get too old. This approach has a lot of drawbacks.Hadoop traditionally doesn’t provide native tools that solve problems like retention, anonymization, reprocessing and other needs.Falcon’s solves this by letting developers work at a much higher level of abstraction.Falcon provides native APIs for data processing, retention, replication and others that abstract away low level tools like scheduling and the mechanical details of replication.With Falcon developers do more, do it easier, and avoid common mistakes.Avoiding common mistakes is probably the most important thing.Data management on Hadoop is not easy, and Falcon was developed by engineers who worked on large scale data management at Yahoo complete with all the battle scars it brings.Falcon has a lot of the practical lessons learned baked into its APIs and ready for developers to simply use.Question: What data lifecycle management needs do you have in your environment?
A pretty common scenario we see is that people have a primary and a DR cluster. The DR cluster tends to be smaller than the primary so you don’t want it to do data processing and you need to store less data on it overall.In this case Falcon manages the flow of taking staged data, cleansing, conforming and presenting that data in the primary cluster.For DR purposes you absolutely need the staged data replicated to the backup cluster. However the backup cluster isn’t powerful enough to do data processing in SLA windows and doesn’t have enough storage for all the cleansed and conformed data.So we don’t replicate that, we only replicate the staged data and the presented data.That way if the primary goes down, clients switch to the failover cluster and continue as if nothing happened. The failover cluster has the staged data so it can be re-imported into the primary and re-processed if the primary was lost.All this can be done in one Falcon job. Doing this by hand is extremely error prone.
Operators can firewall cluster without end user access to “gateway node”Users see one cluster end-point that aggregates capabilities for data access, metadata and job controlProvide perimeter security to make Hadoop security setup easierEnable integration enterprise and cloud identity management environmentsVerificationVerify identity tokenSAML, propagation of identityAuthenticationEstablish identity at Gateway to Authenticate with LDAP + AD
Solid state storage and disk drive evolutionSo far LFF drives seem to be maintaining their economic advantage (4TB drives now & 7TB! next year) SSDs are becoming ubiquitous and will become part of the architectureIn memory databasesBring them on, let’s port them to Yarn!Hadoop complements these technologies, shines w huge dataAtom & ARM processors, GPUs…This is great for Hadoop! But…Vendors are not yet designing the right machines (bandwidth to disk is key bottleneck)Software Defined NetworksMore network functionality for less!
So enterprise Hadoop lies at the heart of the next-generation data architecture.Let’s outline what’s required in and around Hadoop in order to make it easy to use and consume by the enterprise.At the center, we start with Apache Hadoop for distributed file storage and processing (a la MapReduce).In order to enable Hadoop within mainstream enterprises, we need to address enterprise concerns such as high availability, disaster recovery, snapshots, security, etc. And on top of this, we need to provide data services that make it easy to move data in and out of the platform, process and transform the data into useful formats, and enable people and other systems to access the data easily.This is where components like Apache Hive, Pig, HBase, HCatalog, and other tools fit.Making it easy for data workers is important, but it’s also important to make the platform easier to operate.Components like Apache Ambari that address provisioning, management and monitoring of the cluster are important here.So all of that: Core and Platform Services, Data Services, and Operational Services all come together into a vision of “enterprise Hadoop”.Ensuring that Enterprise Hadoop Platform can be flexibly deployed across operating systems and virtual environments like Linux, Windows, and Vmware is important.Targeting Cloud environments like Amazon Web Services, Microsoft Azure, Rackspace OpenCloud, and OpenStack is increasingly important.As is the ability to provide enterprise Hadoop pre-configured within a Hardware appliance like Teradata’s Big Analytics Appliance helps pull Hadoo into enterprises as well.