Introduction to Hadoop Ecosystem was presented to Lansing Java User Group on 2/17/2015 by Vijay Mandava and Lan Jiang. The demo was built on top of HDP 2.2 and AWS cloud.
2. Agenda
What is Big Data ?1
Use Cases & Java Developer fit4
Hadoop Ecosystem3
Relevance to your Enterprise2
Demo5
3. Big Data Definitions
• Wikipedia defines it as “ Data Sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage and process data
within a tolerable elapsed time
• Gartner defines it as Data with the following characteristics
– High Velocity
– High Variety
– High Volume
• Another Definition is “ Big Data is a large volume, unstructured data which
cannot be handled by traditional database management systems
4. Why a game changer
• Schema on Read
– Interpreting data at processing time
– Key, Values are not intrinsic properties of data but chosen by person
analyzing the data
• Move code to data
– With traditional, we bring data to code and I/O becomes a
bottleneck
– With distributed systems, we have to deal with our own
checkpointing/recovery
• More data beats better algorithms
5. Enterprise Relevance
• Missed Opportunities
– Channels
– Data that is analyzed
• Constraint was high cost
– Storage
– Processing
• Future-proof your business
– Schema on Read
– Access pattern not as relevant
– Not just future-proofing your architecture
6. Motivation and History
• Disk access speeds have not caught up with storage capacities
• Need a high speed parallel processing platform to process large
datasets on a distributed filesharing framework
• Google published MapReduce architecture in 2004
• Mapreduce framework
– Split the query, distribute it and process in parallel(Map Step)
– Gather the results and deliver it ( Reduce Step)
• Apache Open Source Project called Hadoop implemented the
MapReduce framework
– “Software library that gives users ability to process large datasets across cluster of
commodity hardware in a reliable, fault-tolerant manner using a simple programming model”
10. Hadoop 2 with YARN
Source: Hadoop In Practice by Alex Holmes
11. Map Reduce
• Restrictive programming model
– Key, values
– Map, reduce functions with only coordination being just
passing keys and values
• But still considered a general data-processing tool
– Google used for production search indexes
– Image Analysis
– Machine learning algorithms
12. PIG
• High level scripting language
• Data Flow Language
– Good for describing data analysis problems as data flows
– Can plugin UDFs written in other languages such as Java, Scala,
JRuby
– Other languages can execute PIG scripts
– Predominant use cases are
• Production ETL jobs
• Data exploration by analysts
• Higher Level Abstraction
– Map Reduce
– Tez
13. Hive
• Framework for data warehouse on top of
Hadoop
– SQL Access on HDFS
– Queries for Analysis
• Batch Oriented
– Impala
– Tez
14. HBase
• NoSQL database on Hadoop
– Based on Google’s BigTable
– Column oriented database on HDFS
• Regular Interactive/Update use cases
– Real time read/write random access
– Row updates are atomic
15. SQOOP
• Import/Export data from RDBMS into Hadoop
– HDFS,Hive, Hbase
– CouchBase
– Uses JDBC driver to get the data types of the columns
– Serialization/Deserialization
• Actual load done internally by Mapreduce jobs
17. Real time streaming with Kafka &
Storm
• Kafka
– Pub/Sub messaging using topics
– Kafka producers publish to topics
• Storm
– Real time computational engine
– Consumes data from spouts and passes data to bolts
– Can run on top of YARN
– Uses Zookeeper, implemented in Clojure
– You define workflows as Directed Acyclic Graphs
– True stream processing engine, so used for low latency ingestion
– Can support At most once, At least once and Exactly Once semantics
18. Apache Spark
• High speed general purpose engine for large-scale data
processing
• Does not need Hadoop, just needs a shared file system such as
S3, NFS or HDFS
• Spark can run on YARN
• Spark is implemented in Scala
• Has Streaming API but a true batch processing engine that micro-
batches
• Can only support Exactly once, but under some failure
conditions degrades to At-least once
19. Common Use Cases
• Queries from Detail Record Data
• Queries from longer duration data
• Diagnostic/Metrics/Web Logs Data Analysis
• 360 degree view incorporating clickstream data
• Unable to generate report within the needed timeframe
• Capture and analyze sensor data
• Analyze large volume of image data
• Build User profiles from large volumes of data
• Sentiment Analysis
• Recommendation Engines
• Risk Analysis
21. Closing
• Technology in hyper growth phase
• Complex
• Tools/Productivity/Monitoring products
evolving
• Pilot Project
• Incremental Journey
22. Demo - Start HDP cluster in AWS
• Total 6 EC2 machine, type t2.medium
• RHEL 6.5, 3.75G Memory, 10G hard drive
• 1 Ambari server + 5-node cluster
• 1 Namenode + 1 Secondary node + 3 Data
Node
• Public data set from
https://data.cityofchicago.org
23. Managing Hadoop Cluster using
Ambari
• Ambari in Indian language stands for a seat
sitting on top of an elephant
• Ambari is an Apache open source project that
is used to
• Provision Hadoop cluster
• Manage Hadoop cluster
• Monitor Hadoop cluster
• Agent-based deployment model
25. Demo — Hue
• Apache Hue provides web interface for
analyzing data in Hadoop
• Use HCatalog to create table
• Demo Hive Script
• Demo Pig Script
26. Demo — Advanced Hive
• Use built-in UDF to extract latitude and longitude info
• Use custom UDF (scala) to calculate distance
between two locations
• Join tables between library and school data and find
libraries within 1 mile for each school
• Use Tableau to connect to Hive through ODBC driver
to plot social economy data