Apache Hadoop 3 is coming! As the next major milestone for hadoop and big data, it attracts everyone's attention as showcase several bleeding-edge technologies and significant features across all components of Apache Hadoop: Erasure Coding in HDFS, Docker container support, Apache Slider integration and Native service support, Application Timeline Service version 2, Hadoop library updates and client-side class path isolation, etc. In this talk, first we will update the status of Hadoop 3.0 releasing work in apache community and the feasible path through alpha, beta towards GA. Then we will go deep diving on each new feature, include: development progress and maturity status in Hadoop 3. Last but not the least, as a new major release, Hadoop 3.0 will contain some incompatible API or CLI changes which could be challengeable for downstream projects and existing Hadoop users for upgrade - we will go through these major changes and explore its impact to other projects and users.
Speaker: Sanjay Radia, Founder and Chief Architect, Hortonworks
Data Trends
From Characteristics of the Data to Data Consumption & Interaction
According to IBM, every day we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years.
Insight from Data is a key competitive differentiator
Open Source is evolving and adapting with these trends the fastest
Adopting Hadoop is not a destination but a journey
Data Trends
From Characteristics of the Data to Data Consumption & Interaction
According to IBM, every day we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years.
Insight from Data is a key competitive differentiator
Open Source is evolving and adapting with these trends the fastest
Adopting Hadoop is not a destination but a journey
it enables online EC which bypasses the conversion phase and immediately saves storage space; this is especially desirable in clusters with high end networking. Second, it naturally distributes a small file to multiple DataNodesand eliminates the need to bundle multiple files into a single coding group.
Data Trends
From Characteristics of the Data to Data Consumption & Interaction
According to IBM, every day we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years.
Insight from Data is a key competitive differentiator
Open Source is evolving and adapting with these trends the fastest
Adopting Hadoop is not a destination but a journey
Previously based on uncomsumed capacity
If 70% capacity has lots of uncomsumed capcity it is scheduled first
Now you can say that the 30% queue is higher priority
Original Yarn design was not just for batch jobs.
- we started with that but the design was general
Graceful degradation
- remove nodes gracefully
- for cloud especially if you are using spot pricing
App centric – top two left pictures
Node centric
Resource centric – load vs capacity – overall and by queues
Cluster centric –
nodes summary
heatmap of resource usage across nodes
Data Trends
From Characteristics of the Data to Data Consumption & Interaction
According to IBM, every day we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years.
Insight from Data is a key competitive differentiator
Open Source is evolving and adapting with these trends the fastest
Adopting Hadoop is not a destination but a journey
Data Trends
From Characteristics of the Data to Data Consumption & Interaction
According to IBM, every day we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years.
Insight from Data is a key competitive differentiator
Open Source is evolving and adapting with these trends the fastest
Adopting Hadoop is not a destination but a journey