SlideShare uma empresa Scribd logo
1 de 60
Visual Mapping of Clickstream Data:
Introduction and Demonstration
Cedric Carbone, Ciaran Dynes
Talend
2
© Talend 2014
Visual mapping of
Clickstream data: introduction
and demonstration
Ciaran Dynes VP Products
Cedric Carbone CTO
3
© Talend 2014
Agenda
• Clickstream live demo
• Moving from hand-code to code generation
• Performance benchmark
• Optimization of code generation
4
© Talend 2014
Hortonworks Clickstream demo
http://hortonworks.com/hadoop-tutorial/how-to-visualize-website-clickstream-data/
5
© Talend 2014
Trying to get from this…
6
© Talend 2014
Big Data – “pure Hadoop”
Visual design in Map Reduce and optimize before
deploying on Hadoop
to this…
7
© Talend 2014
Demo overview
• Demo flow overview :-
1. Load raw Omniture web log files to HDFS
• Can discuss the ‘schema on read’ principle, how it allows any data type to be
easily loaded to a ‘data lake’ and is then available for analytical processing
• http://ibmdatamag.com/2013/05/why-is-schema-on-read-so-useful/
2. Define a Map/Reduce process to transform the data
• Identical skills to any graphical ETL tool
• Lookup customer and product data to enrich the results
• Results written back to HDFS
3. Federate the results to a visualisation tool of your choice
• Excel
• Analytics tool such Tableau, Qlikview, etc.
• Google Charts
8
© Talend 2014
Big Data Clickstream Analysis
Clickstream Dashboard
TALEND
Load to HDFS
TALEND
BIG DATA
(Integration)
TALEND
Federate to
analytics
HADOOP
HDFS Map/Reduce
Web logs
Hive
9
© Talend 2014
Native Map/Reduce Jobs
• Create classic ETL patterns using native Map/Reduce
- Only data management solution on the market to generate native
Map/Reduce code
• No need for expensive
big data coding skills
• Zero pre-installation on
the Hadoop cluster
• Hadoop is the “engine”
for data processing
#dataos
10
© Talend 2014
SHOW ME
11
© Talend 2014
PERFORMANCE OF CODE
GENERATION
12
© Talend 2014
MapReduce 2.0, YARN, Storm, Spark
• Yarn: Ensures predictable performance & QoS for all apps
• Enables apps to run “IN” Hadoop rather than “ON”
• In Labs: Streaming with Apache Storm
• In Labs: mini-Batch and In-Memory with Apache Spark
Applications Run Natively IN Hadoop
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm, Spark)
GRAPH
(Giraph)
NoSQL
(MongoDB)
EVENTS
(Falcon)
ONLINE
(HBase)
OTHER
(Search)
Source: Hortonworks
13
© Talend 2014
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm, Spark)
GRAPH
(Giraph)
NoSQL
(MongoDB)
Events
(Falcon)
ONLINE
(HBase)
OTHER
(Search)
Talend: Tap – Transform – Deliver
TRANSFORM (Data Refinement)
PROFILE PARSEMAP CDCCLEANSE
STANDARD-
IZE
MACHINE
LEARNING
MATCH
TAP
(Ingestion)
SQOOP
FLUME
HDFS API
HBase API
HIVE
800+
DELIVER
(as an API)
ActiveMQKaraf
CamelCXF
KafkaStorm
MetaSecurity
MDMiPaaS
GovernHA
14
© Talend 2014
© Talend 2013
• Context : 9 Nodes cluster, Replication: 3
- DELL R210-II, 1 Xeon® E3 1230 v2, 4 Cores, 16 Go RAM
- Map Slots : 2 Slots / Node
- Reduce Slots : 2 Slots / Node
• Total Processing Capabilities :
- 9*2 Maps Slots : 18 Maps
- 9*2 Reduce Slots : 18 Reduces
• Data Volume : 1,10,100GB
Talend Labs Benchmark Environment
15
© Talend 2014
© Talend 2013
• PIG and Hive Apache communities are usingTPCH
benchmarks
- https://issues.apache.org/jira/browse/PIG-2397
- https://issues.apache.org/jira/browse/HIVE-600
• We are currently running the same tests in our labs
- Pig Hand Coded script vs. Talend Pig generated code
- Pig Hand Coded script vs. Talend Map/Reduce generated code
- Hive QL produced by community vs. Hive ELT capabilities
• Partial results already available for Pig
- Very good results
TPCH Benchmark
16
© Talend 2014
Optimizing Job configuration ?
• By default, Talend follows Hadoop recommendations
regarding the number of reducers usable for the job
execution.
• The rule is that 99% of the total reducers available can be
used
- http://wiki.apache.org/hadoop/HowManyMapsAndReduces
- For Talend benchmark, default max reducers is :
• 3 nodes : 5 (3*2 = 6 * 99% = 5)
• 6 nodes : 11 (6*2 = 12 * 99% = 11)
• 9 nodes : 17 (9*2 = 18 * 99% = 17)
- Another customer benchmark, default max reducer :
• 700 * 99% = 693 nodes (assumption with half Dell and half HP servers)
© Talend 2013
17
© Talend 2014
TPCH Results : Pig Hand Coded vs Pig generated
© Talend 2013
• 19 tests with results similar or better to Pig Hand Coded scripts
18
© Talend 2014
TPCH Results : Pig Hand Coded vs Pig generated
© Talend 2013
• 19 tests with results similar or better to Pig Hand Coded scripts
• Code is already optimized and automatically applied
Talend
code
is faster
19
© Talend 2014
PERFORMANCE IMPROVEMENTS
20
© Talend 2014
TPCH Results : Pig Hand Coded vs Pig generated
© Talend 2013
• 19 tests with results similar or better to Pig Hand Coded scripts
• 3 tests will benefit from a new COGROUP feature
Requires
CoGroup
1
21
© Talend 2014
Example: How Sort works for Hadoop
Talend has implemented the TeraSort Algorithm
for Hadoop
1. 1st Map/Reduce Job is generated to analyze the data ranges
- Each Mapper reads its data and analyze its bucket critical values
- The reduce will produce Quartile files for all the data to sort
2. 2nd Map/Reduce job is started
- Each Map does simply send the key to sort to the reducer
- A custom partitioner is created to send the data to the best bucket
depending on the quartile file previously created
- Each reducer will output the data sorted by buckets
• Research: tSort : GraySort, MinuteSort
© Talend 2013
2
22
© Talend 2014
How-to-Get Sandbox!
• Videos on the Jumpstart
- How to Launch http://youtu.be/J3Ppr9Cs9wA
- Clickstream video http://youtu.be/OBYYFLmdCXg
• To get the Sandbox
- http://www.talend.com/contact
23
© Talend 2014
Step-by-Step Directions
• Completely Self-contained Demo VM Sandbox
• Key Scenarios like Clickstream Analysis
24
© Talend 2014
Come try the Sandbox
Hortonworks Dev Café & Talend
2
25
© Talend 2014
RUNTIME PLATFORM (JAVA, Hadoop, SQL, etc.)
Talend Platform for Big Data v5.4
Talend Platform for Big Data
TALEND UNIFIED PLATFORM
Studio Repository Deployment Execution Monitoring
DATA INTEGRATION
Data
Access ETL / ELT Version
Control
Business
Rules
Change
Data Capture Scheduler Parallel
Processing
High
Availability
Big DATA QUALITY
Hive Data
Profiling
Drill-down
to Values
DQ Portal,
Monitoring
Data
Stewardship
Report
Design
Address
Validation
Custom
Analysis
M/R Parsing,
Matching
BIG DATA
Hadoop 2.0
MapReduce
ETL/ELT
Hcatalog/
meta-data
Pig, Sqoop,
Hive
Hadoop Job
Scheduler
Google Big
Query
NoSQL
SupportHDFS
NonStop HBase – Making HBase
Continuously Available for Enterprise
Deployment
Dr. Konstantin Boudnik
WANdisco
Non-Stop HBase
Making HBase Continuously Available for
Enterprise Deployment
Konstantin Boudnik – Director, Advanced Technologies, WANdisco
Brett Rudenstein – Senior Product Manager, WANdisco
WANdisco: continuous availability company
 WANdisco := Wide Area Network Distributed Computing
 We solve availability problems for enterprises.. If you can’t afford 99.999% - we’ll help
 Publicly trading at London Stock Exchange since mid-2012 (LSE:WAND)
 Apache Software Foundation sponsor; actively contributing to Hadoop, SVN, and others
 US patented active-active replication technology
 Located on three continents
 Enterprise ready, high availability software solutions that enable globally distributed
organizations to meet today’s data challenges of secure storage, scalability and
availability
 Subversion, Git, Hadoop HDFS, HBase at 200+ customer sites
What are we solving?
Traditionally everybody relies on backups
HA is (mostly) a glorified backup
 Redundancy of critical elements
- Standby servers
- Backup network links
- Off-site copies of critical data
- RAID mirroring
 Baseline:
- Create and synchronize replicas
- Clients switching in case of failure
- Extra hardware allaying idly spinning “just in case”
A Typical Architecture (HDFS HA)
Backups can fail
WANdisco Active-Active Architecture
/ page 35
 100% Uptime with WANdisco’s patented replication technology
- Zero downtime / zero data loss
- Enables maintenance without downtime
 Automatic recovery of failed servers; Automatic rebalancing as workload increases
HDFS Data
Multi-threaded Server Software:
Multiple threads processing client requests in a loop
Server
Process
make change to state (db)
get client request e.g.
hbase put
send return value to
client
OP OP OP OP
OP
OP
OP OPOP OP
OP
OP
thread
1
thread
3
thread
2
thread
1
thread
2
thread
3
acquire
lock
release
lock
Ways to achieve single server redundancy
Using a TCP Connection to send data to three
replicated servers (Load Balancer)
serve
r3
Server
Process
OP OP
serve
r2
Server
Process
OP OP OP OP
serve
r1
Server
Process
OP OP OP OP
Client
OP OP OP OP
Load
Balancer
Load
Balancer
HBase WAL replication
 State Machine (HRegion contents, HMaster metadata, etc.) is modified first
 Modification Log (HBase WAL) is sent to a Highly Available shared storage
 Standby Server(s) read edits log and serve as warm standby servers, ready to take
over should the active server fail
HBase WAL replication
serve
r1
Server
Process
OP OP OP OP
server
2
Server
ProcessShared
Storage
Standby
Server
WAL Entries
Single Active
Server
HBase WAL tailing, WAL Snapshots etc.
 Only one active region server is possible
 Failover takes time
 Failover is error prone
 RegionServer failover isn’t seamless for clients
Implementing multiple active masters
with Paxos coordination
(not about leader election)
Three replicated servers
serve
r3
Server
Process
OP OP OP OP
Distributed
Coordination
Engine
serve
r2
Server
Process
Distributed
Coordination
Engine
OP OP OP OP
serve
r1
Server
Process
OP OP OP OP
Distributed
Coordination
Engine Paxos
DConE
Clie
nt
Clie
nt
Clie
nt
Clie
nt
Clie
nt
Paxos
DConE
OP
OPOP
OP
HBase Continuous Availability
(multiple active masters)
HBase Single Points of Failure
 Single HBase Master
- Service interruption after Master failure
 Hbase client
- Client session doesn’t failover after a RegionServer failure
 HBase Region Server: downtime
- 30 secs ≥ MMTR ≤ 200 secs
 Region major compaction (not a failure, but…)
- (un)-scheduled downtime of a region for compaction
HBase Region Server
& Master Replication
NonStopRegionServer:
Client
Service
e.g. multi
Client
Service
DConE
HRegionServer
NonStopRegionServer
1
Client
Service
e.g. multi
Client
Service
DConE
HRegionServer
NonStopRegionServer
2
Hbase
Client
1. Client calls
HRegionServer multi
2. NonStopRegionServer
intercepts
3. NonStopRegionServer makes
paxos
proposal using DConE
library4. Proposal comes back as
agreement
on all
NonStopRegionServers
5. NonStopRegionServer calls
super.multi
on all nodes. State changes
are recorded
6. NonStopRegionServer 1
alone sends
response back to client
HMaster is similar
HBase RegionServer replication using
WANdisco DConE
 Shared nothing architecture
 HFiles, WALs etc. are not shared
 Replica count is tuned
 Snapshots of HFiles do not need to be created
 Messy details of WAL tailing are not necessary:
- WAL might not be needed at all (!)
 Not an eventual consistency model
 Does not serve up stale data
/ page 54
DEMO
DEMO
/ page
55
/ page
56
/ page
57
/ page 58
DEMO
Q & A
Thank you
Konstantin Boudnik
cos@wandisco.com
@c0sin
Visual Mapping of Clickstream Data

Mais conteúdo relacionado

Destaque

Clickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersClickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersAlbert Hui
 
Clickstream Analysis with Spark
Clickstream Analysis with Spark Clickstream Analysis with Spark
Clickstream Analysis with Spark Josef Adersberger
 
Clickstream Analysis
Clickstream AnalysisClickstream Analysis
Clickstream Analysisintuitiv.de
 
Experiments and Results on Click stream analysis using R
Experiments and Results on Click stream analysis using RExperiments and Results on Click stream analysis using R
Experiments and Results on Click stream analysis using RPridhvi Kodamasimham
 
Web log & clickstream
Web log & clickstream Web log & clickstream
Web log & clickstream Michel Bruley
 
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Spark Summit
 
Web Analytics: Challenges in Data Modeling
Web Analytics: Challenges in Data ModelingWeb Analytics: Challenges in Data Modeling
Web Analytics: Challenges in Data ModelingExcella
 
Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with sparkMarissa Saunders
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handyPraveen Sripati
 
Thoughts on building deployable and updatable share point solutions
Thoughts on building deployable and updatable share point solutionsThoughts on building deployable and updatable share point solutions
Thoughts on building deployable and updatable share point solutionsSerge van den Oever
 
Web Metircs and KPI
Web Metircs and KPIWeb Metircs and KPI
Web Metircs and KPIShipra Malik
 
Spark Streaming and Expert Systems
Spark Streaming and Expert SystemsSpark Streaming and Expert Systems
Spark Streaming and Expert SystemsJim Haughwout
 
Graph Processing Applications @ HUG
Graph Processing Applications @ HUGGraph Processing Applications @ HUG
Graph Processing Applications @ HUGPraveen Sripati
 
What Big Data Folks Need to Know About DevOps
What Big Data Folks Need to Know About DevOpsWhat Big Data Folks Need to Know About DevOps
What Big Data Folks Need to Know About DevOpsMatt Ray
 

Destaque (20)

Clickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersClickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customers
 
Clickstream Analysis with Spark
Clickstream Analysis with Spark Clickstream Analysis with Spark
Clickstream Analysis with Spark
 
Clickstream Analysis
Clickstream AnalysisClickstream Analysis
Clickstream Analysis
 
Experiments and Results on Click stream analysis using R
Experiments and Results on Click stream analysis using RExperiments and Results on Click stream analysis using R
Experiments and Results on Click stream analysis using R
 
Web log & clickstream
Web log & clickstream Web log & clickstream
Web log & clickstream
 
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
 
Web Analytics: Challenges in Data Modeling
Web Analytics: Challenges in Data ModelingWeb Analytics: Challenges in Data Modeling
Web Analytics: Challenges in Data Modeling
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache Spark
 
clickstream analysis
 clickstream analysis clickstream analysis
clickstream analysis
 
Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with spark
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Big deal big data
Big deal big dataBig deal big data
Big deal big data
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
 
Click Stream Analysis
Click Stream AnalysisClick Stream Analysis
Click Stream Analysis
 
Thoughts on building deployable and updatable share point solutions
Thoughts on building deployable and updatable share point solutionsThoughts on building deployable and updatable share point solutions
Thoughts on building deployable and updatable share point solutions
 
Web Metircs and KPI
Web Metircs and KPIWeb Metircs and KPI
Web Metircs and KPI
 
Spark Streaming and Expert Systems
Spark Streaming and Expert SystemsSpark Streaming and Expert Systems
Spark Streaming and Expert Systems
 
Graph Processing Applications @ HUG
Graph Processing Applications @ HUGGraph Processing Applications @ HUG
Graph Processing Applications @ HUG
 
What Big Data Folks Need to Know About DevOps
What Big Data Folks Need to Know About DevOpsWhat Big Data Folks Need to Know About DevOps
What Big Data Folks Need to Know About DevOps
 
Metadata Mapping & Crosswalks
Metadata Mapping & CrosswalksMetadata Mapping & Crosswalks
Metadata Mapping & Crosswalks
 

Semelhante a Visual Mapping of Clickstream Data

Talend Big Data Capabilities - 2014
Talend Big Data Capabilities - 2014Talend Big Data Capabilities - 2014
Talend Big Data Capabilities - 2014Rajan Kanitkar
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudAlluxio, Inc.
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHortonworks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Hortonworks
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
 
LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLinaro
 
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Ganesh Raju
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataPentaho
 

Semelhante a Visual Mapping of Clickstream Data (20)

Talend Big Data Capabilities - 2014
Talend Big Data Capabilities - 2014Talend Big Data Capabilities - 2014
Talend Big Data Capabilities - 2014
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the Cloud
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96Boards
 
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Último (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

Visual Mapping of Clickstream Data

  • 1. Visual Mapping of Clickstream Data: Introduction and Demonstration Cedric Carbone, Ciaran Dynes Talend
  • 2. 2 © Talend 2014 Visual mapping of Clickstream data: introduction and demonstration Ciaran Dynes VP Products Cedric Carbone CTO
  • 3. 3 © Talend 2014 Agenda • Clickstream live demo • Moving from hand-code to code generation • Performance benchmark • Optimization of code generation
  • 4. 4 © Talend 2014 Hortonworks Clickstream demo http://hortonworks.com/hadoop-tutorial/how-to-visualize-website-clickstream-data/
  • 5. 5 © Talend 2014 Trying to get from this…
  • 6. 6 © Talend 2014 Big Data – “pure Hadoop” Visual design in Map Reduce and optimize before deploying on Hadoop to this…
  • 7. 7 © Talend 2014 Demo overview • Demo flow overview :- 1. Load raw Omniture web log files to HDFS • Can discuss the ‘schema on read’ principle, how it allows any data type to be easily loaded to a ‘data lake’ and is then available for analytical processing • http://ibmdatamag.com/2013/05/why-is-schema-on-read-so-useful/ 2. Define a Map/Reduce process to transform the data • Identical skills to any graphical ETL tool • Lookup customer and product data to enrich the results • Results written back to HDFS 3. Federate the results to a visualisation tool of your choice • Excel • Analytics tool such Tableau, Qlikview, etc. • Google Charts
  • 8. 8 © Talend 2014 Big Data Clickstream Analysis Clickstream Dashboard TALEND Load to HDFS TALEND BIG DATA (Integration) TALEND Federate to analytics HADOOP HDFS Map/Reduce Web logs Hive
  • 9. 9 © Talend 2014 Native Map/Reduce Jobs • Create classic ETL patterns using native Map/Reduce - Only data management solution on the market to generate native Map/Reduce code • No need for expensive big data coding skills • Zero pre-installation on the Hadoop cluster • Hadoop is the “engine” for data processing #dataos
  • 11. 11 © Talend 2014 PERFORMANCE OF CODE GENERATION
  • 12. 12 © Talend 2014 MapReduce 2.0, YARN, Storm, Spark • Yarn: Ensures predictable performance & QoS for all apps • Enables apps to run “IN” Hadoop rather than “ON” • In Labs: Streaming with Apache Storm • In Labs: mini-Batch and In-Memory with Apache Spark Applications Run Natively IN Hadoop HDFS2 (Redundant, Reliable Storage) YARN (Cluster Resource Management) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, Spark) GRAPH (Giraph) NoSQL (MongoDB) EVENTS (Falcon) ONLINE (HBase) OTHER (Search) Source: Hortonworks
  • 13. 13 © Talend 2014 HDFS2 (Redundant, Reliable Storage) YARN (Cluster Resource Management) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, Spark) GRAPH (Giraph) NoSQL (MongoDB) Events (Falcon) ONLINE (HBase) OTHER (Search) Talend: Tap – Transform – Deliver TRANSFORM (Data Refinement) PROFILE PARSEMAP CDCCLEANSE STANDARD- IZE MACHINE LEARNING MATCH TAP (Ingestion) SQOOP FLUME HDFS API HBase API HIVE 800+ DELIVER (as an API) ActiveMQKaraf CamelCXF KafkaStorm MetaSecurity MDMiPaaS GovernHA
  • 14. 14 © Talend 2014 © Talend 2013 • Context : 9 Nodes cluster, Replication: 3 - DELL R210-II, 1 Xeon® E3 1230 v2, 4 Cores, 16 Go RAM - Map Slots : 2 Slots / Node - Reduce Slots : 2 Slots / Node • Total Processing Capabilities : - 9*2 Maps Slots : 18 Maps - 9*2 Reduce Slots : 18 Reduces • Data Volume : 1,10,100GB Talend Labs Benchmark Environment
  • 15. 15 © Talend 2014 © Talend 2013 • PIG and Hive Apache communities are usingTPCH benchmarks - https://issues.apache.org/jira/browse/PIG-2397 - https://issues.apache.org/jira/browse/HIVE-600 • We are currently running the same tests in our labs - Pig Hand Coded script vs. Talend Pig generated code - Pig Hand Coded script vs. Talend Map/Reduce generated code - Hive QL produced by community vs. Hive ELT capabilities • Partial results already available for Pig - Very good results TPCH Benchmark
  • 16. 16 © Talend 2014 Optimizing Job configuration ? • By default, Talend follows Hadoop recommendations regarding the number of reducers usable for the job execution. • The rule is that 99% of the total reducers available can be used - http://wiki.apache.org/hadoop/HowManyMapsAndReduces - For Talend benchmark, default max reducers is : • 3 nodes : 5 (3*2 = 6 * 99% = 5) • 6 nodes : 11 (6*2 = 12 * 99% = 11) • 9 nodes : 17 (9*2 = 18 * 99% = 17) - Another customer benchmark, default max reducer : • 700 * 99% = 693 nodes (assumption with half Dell and half HP servers) © Talend 2013
  • 17. 17 © Talend 2014 TPCH Results : Pig Hand Coded vs Pig generated © Talend 2013 • 19 tests with results similar or better to Pig Hand Coded scripts
  • 18. 18 © Talend 2014 TPCH Results : Pig Hand Coded vs Pig generated © Talend 2013 • 19 tests with results similar or better to Pig Hand Coded scripts • Code is already optimized and automatically applied Talend code is faster
  • 20. 20 © Talend 2014 TPCH Results : Pig Hand Coded vs Pig generated © Talend 2013 • 19 tests with results similar or better to Pig Hand Coded scripts • 3 tests will benefit from a new COGROUP feature Requires CoGroup 1
  • 21. 21 © Talend 2014 Example: How Sort works for Hadoop Talend has implemented the TeraSort Algorithm for Hadoop 1. 1st Map/Reduce Job is generated to analyze the data ranges - Each Mapper reads its data and analyze its bucket critical values - The reduce will produce Quartile files for all the data to sort 2. 2nd Map/Reduce job is started - Each Map does simply send the key to sort to the reducer - A custom partitioner is created to send the data to the best bucket depending on the quartile file previously created - Each reducer will output the data sorted by buckets • Research: tSort : GraySort, MinuteSort © Talend 2013 2
  • 22. 22 © Talend 2014 How-to-Get Sandbox! • Videos on the Jumpstart - How to Launch http://youtu.be/J3Ppr9Cs9wA - Clickstream video http://youtu.be/OBYYFLmdCXg • To get the Sandbox - http://www.talend.com/contact
  • 23. 23 © Talend 2014 Step-by-Step Directions • Completely Self-contained Demo VM Sandbox • Key Scenarios like Clickstream Analysis
  • 24. 24 © Talend 2014 Come try the Sandbox Hortonworks Dev Café & Talend 2
  • 25. 25 © Talend 2014 RUNTIME PLATFORM (JAVA, Hadoop, SQL, etc.) Talend Platform for Big Data v5.4 Talend Platform for Big Data TALEND UNIFIED PLATFORM Studio Repository Deployment Execution Monitoring DATA INTEGRATION Data Access ETL / ELT Version Control Business Rules Change Data Capture Scheduler Parallel Processing High Availability Big DATA QUALITY Hive Data Profiling Drill-down to Values DQ Portal, Monitoring Data Stewardship Report Design Address Validation Custom Analysis M/R Parsing, Matching BIG DATA Hadoop 2.0 MapReduce ETL/ELT Hcatalog/ meta-data Pig, Sqoop, Hive Hadoop Job Scheduler Google Big Query NoSQL SupportHDFS
  • 26.
  • 27. NonStop HBase – Making HBase Continuously Available for Enterprise Deployment Dr. Konstantin Boudnik WANdisco
  • 28. Non-Stop HBase Making HBase Continuously Available for Enterprise Deployment Konstantin Boudnik – Director, Advanced Technologies, WANdisco Brett Rudenstein – Senior Product Manager, WANdisco
  • 29. WANdisco: continuous availability company  WANdisco := Wide Area Network Distributed Computing  We solve availability problems for enterprises.. If you can’t afford 99.999% - we’ll help  Publicly trading at London Stock Exchange since mid-2012 (LSE:WAND)  Apache Software Foundation sponsor; actively contributing to Hadoop, SVN, and others  US patented active-active replication technology  Located on three continents  Enterprise ready, high availability software solutions that enable globally distributed organizations to meet today’s data challenges of secure storage, scalability and availability  Subversion, Git, Hadoop HDFS, HBase at 200+ customer sites
  • 30. What are we solving?
  • 32. HA is (mostly) a glorified backup  Redundancy of critical elements - Standby servers - Backup network links - Off-site copies of critical data - RAID mirroring  Baseline: - Create and synchronize replicas - Clients switching in case of failure - Extra hardware allaying idly spinning “just in case”
  • 35. WANdisco Active-Active Architecture / page 35  100% Uptime with WANdisco’s patented replication technology - Zero downtime / zero data loss - Enables maintenance without downtime  Automatic recovery of failed servers; Automatic rebalancing as workload increases HDFS Data
  • 36. Multi-threaded Server Software: Multiple threads processing client requests in a loop Server Process make change to state (db) get client request e.g. hbase put send return value to client OP OP OP OP OP OP OP OPOP OP OP OP thread 1 thread 3 thread 2 thread 1 thread 2 thread 3 acquire lock release lock
  • 37. Ways to achieve single server redundancy
  • 38. Using a TCP Connection to send data to three replicated servers (Load Balancer) serve r3 Server Process OP OP serve r2 Server Process OP OP OP OP serve r1 Server Process OP OP OP OP Client OP OP OP OP Load Balancer Load Balancer
  • 39. HBase WAL replication  State Machine (HRegion contents, HMaster metadata, etc.) is modified first  Modification Log (HBase WAL) is sent to a Highly Available shared storage  Standby Server(s) read edits log and serve as warm standby servers, ready to take over should the active server fail
  • 40. HBase WAL replication serve r1 Server Process OP OP OP OP server 2 Server ProcessShared Storage Standby Server WAL Entries Single Active Server
  • 41. HBase WAL tailing, WAL Snapshots etc.  Only one active region server is possible  Failover takes time  Failover is error prone  RegionServer failover isn’t seamless for clients
  • 42. Implementing multiple active masters with Paxos coordination (not about leader election)
  • 43. Three replicated servers serve r3 Server Process OP OP OP OP Distributed Coordination Engine serve r2 Server Process Distributed Coordination Engine OP OP OP OP serve r1 Server Process OP OP OP OP Distributed Coordination Engine Paxos DConE Clie nt Clie nt Clie nt Clie nt Clie nt Paxos DConE OP OPOP OP
  • 45. HBase Single Points of Failure  Single HBase Master - Service interruption after Master failure  Hbase client - Client session doesn’t failover after a RegionServer failure  HBase Region Server: downtime - 30 secs ≥ MMTR ≤ 200 secs  Region major compaction (not a failure, but…) - (un)-scheduled downtime of a region for compaction
  • 46. HBase Region Server & Master Replication
  • 47.
  • 48.
  • 49.
  • 50.
  • 51. NonStopRegionServer: Client Service e.g. multi Client Service DConE HRegionServer NonStopRegionServer 1 Client Service e.g. multi Client Service DConE HRegionServer NonStopRegionServer 2 Hbase Client 1. Client calls HRegionServer multi 2. NonStopRegionServer intercepts 3. NonStopRegionServer makes paxos proposal using DConE library4. Proposal comes back as agreement on all NonStopRegionServers 5. NonStopRegionServer calls super.multi on all nodes. State changes are recorded 6. NonStopRegionServer 1 alone sends response back to client HMaster is similar
  • 52. HBase RegionServer replication using WANdisco DConE  Shared nothing architecture  HFiles, WALs etc. are not shared  Replica count is tuned  Snapshots of HFiles do not need to be created  Messy details of WAL tailing are not necessary: - WAL might not be needed at all (!)  Not an eventual consistency model  Does not serve up stale data
  • 53.