Mais conteúdo relacionado Semelhante a WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend (20) WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend2. Kai Wähner
Main Tasks
Requirements Engineering
Enterprise Architecture Management
Business Process Management
Architecture and Development of Applications
Service-oriented Architecture
Integration of Legacy Applications
Cloud Computing
Big Data
Consulting
Developing
Coaching
Speaking
Writing
© Talend 2013
Contact
Email: kontakt@kai-waehner.de
Blog: www.kai-waehner.de/blog
Twitter: @KaiWaehner
Social Networks: Xing, LinkedIn
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
3. Key messages
You have to care about big data to be competitive in the future!
You have to integrate different sources to get most value out of it!
Big data integration is no (longer) rocket science!
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
4. Agenda
• Big data paradigm shift
• Challenges for integrating big data
• Choosing Hadoop for big data integration
• Integration with an open source framework
• Integration with an open source suite
• Custom big data connectors
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
5. Agenda
• Big data paradigm shift
• Challenges for integrating big data
• Choosing Hadoop for big data integration
• Integration with an open source framework
• Integration with an open source suite
• Custom big data connectors
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
6. Why should you care about big data?
“If you can't measure it,
you can't manage it.”
William Edwards Deming
(1900 –1993)
American statistician, professor,
author, lecturer and consultant
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
7. Why should you care about big data?
„Silence the HiPPOs“ (highest-paid person‘s opinion)
Being able to interpret unimaginable large data
stream, the gut feeling is no longer justified!
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
8. What is big data? The Vs of big data
Volume
(terabytes,
petabytes)
Velocity
(realtime or nearrealtime)
Variety
(social networks,
blog posts, logs,
sensors, etc.)
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Value
9. Big data tasks to solve - before analysis
Big Data Integration
– Land data in a Big Data cluster
– Implement or generate parallel processes
Big Data Manipulation
– Simplify manipulation, such as sort and filter
– Computational expensive functions
Big Data Quality & Governance
– Identify linkages and duplicates, validate big data
– Match connector, execute basic quality features
Big Data Project Management
– Place frameworks around big data projects
– Common Repository, scheduling, monitoring
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
10. Use case: Clickstream Analysis
“The advantage of their new system is that they can now look at their data
[from their log processing system] in anyway they want:
➜ Nightly MapReduce jobs collect statistics about their mail system such as
spam counts by domain, bytes transferred and number of logins.
➜ When they wanted to find out which part of the world their customers
logged in from, a quick [ad hoc] MapReduce job was created and they had
the answer within a few hours. Not really possible in your typical ETL
system.”
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
11. Use case: Risk management
Deduce
Customer
Defections
http://hkotadia.com/archives/5021
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
12. Use case: Flexible pricing
➜
➜
➜
➜
With revenue of almost USD 30 billion and a network of
800 locations, Macy's is considered the largest store operator in the
USA
Daily price check analysis of its 10,000 articles in less than two hours
Whenever a neighboring competitor anywhere between New York
and Los Angeles goes for aggressive price reductions, Macy's follows
its example
If there is no market competitor, the prices remain unchanged
http://www.t-systems.com/about-t-systems/examples-of-successes-companies-analyze-big-data-in-record-time-l-t-systems/1029702
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
13. Storage: Compliance
Global Parcel Service
A lot of data must be stored „forever“
➜ Numbers increase exponentially
➜ Goal: As cheap as possible
➜ Problem: (Fast) queries must still be possible
➜ Solution: Commodity servers and „Hadoop querying“
➜
http://archive.org/stream/BigDataImPraxiseinsatz-SzenarienBeispieleEffekte/Big_Data_BITKOM-Leitfaden_Sept.2012#page/n0/mode/2up
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
14. Agenda
• Big data paradigm shift
• Challenges for integrating big data
• Choosing Hadoop for big data integration
• Integration with an open source framework
• Integration with an open source suite
• Custom big data connectors
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
15. Limited big data experts
This is your
company
Big Data Geek
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
16. Big data tool selection (business perspective)
Wanna buy a big data solution for your industry?
➜ Maybe a competitor has a big data solution which
adds business value?
➜ The competitor will never publish it (rat-race)!
➜
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
17. Big data tool selection (technical perspective)
Looking for ‚your‘ required big data product?
Support your data from scratch?
Good luck!
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
18. Data quality
Big Data + Poor Data Quality = Big Problems
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
19. How to solve these big data challenges?
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
20. What is your Big Data process?
Step 1
Step 2
Step 3
1) Do not begin with the data, think about business opportunities
2) Choose the right data (combine different data sources)
3) Use easy tooling
http://hbr.org/2012/10/making-advanced-analytics-work-for-you
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
21. Agenda
• Big data paradigm shift
• Challenges for integrating big data
• Choosing Hadoop for big data integration
• Integration with an open source framework
• Integration with an open source suite
• Custom big data connectors
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
23. How to process big data?
The critical flaw in parallel ETL tools is the fact that the data is almost never local to the processing
nodes. This means that every time a large job is run, the data has to first be read from the source,
split N ways and then delivered to the individual nodes. Worse, if the partition key of the source
doesn’t match the partition key of the target, data has to be constantly exchanged among the
nodes. In essence, parallel ETL treats the network as if it were a physical I/O subsystem. The
network, which is always the slowest part of the process, becomes the weakest link in the
performance chain.
http://blog.syncsort.com/2012/08/parallel-etl-tools-are-dead
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
24. How to process big data?
Slides: http://www.slideshare.net/pavlobaron/100-big-data-0-hadoop-0-java
Video: http://www.infoq.com/presentations/Big-Data-Hadoop-Java
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
25. How to process big data?
The defacto standard for big data processing
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
26. How to process big data?
“A big part of [the
company’s strategy]
includes wiring SQL Server
2012 (formerly known by
the codename “Denali”) to
the Hadoop distributed
computing platform, and
bringing Hadoop to
Windows Server and Azure”
Even Microsoft (the .NET house) relies on Hadoop since 2011
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
27. What is Hadoop?
Apache Hadoop, an open-source software library, is a
framework that allows for the distributed processing of
large data sets across clusters of commodity hardware
using simple programming models. It is designed to scale
up from single servers to thousands of machines, each
offering local computation and storage.
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
28. Map (Shuffle) Reduce
Simple example
• Input: (very large) text files with lists of strings, such as:
„318, 0043012650999991949032412004...0500001N9+01111+99999999999...“
• We are interested just in some content: year and temperate (marked in red)
• The Map Reduce function has to compute the maximum temperature for every year
Example from the book “Hadoop: The Definitive Guide, 3rd Edition”
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
29. How to process big data?
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
30. Agenda
• Big data paradigm shift
• Challenges for integrating big data
• Choosing Hadoop for big data integration
• Integration with an open source framework
• Integration with an open source suite
• Custom big data connectors
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
31. Alternatives for systems integration
Integration Suite
Enterprise
Service Bus
Integration
Framework
Low
Connectivity
Routing
Transformation
© Talend 2013
High
+
INTEGRATION
Tooling
Monitoring
Support
+
BUSINESS PROCESS MGT.
BIG DATA / MDM
REGISTRY / REPOSITORY
RULES ENGINE
„YOU NAME IT“
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Complexity
of Integration
32. Alternatives for systems integration
Integration
Framework
Enterprise
Service Bus
Low
© Talend 2013
Integration Suite
High
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Complexity
of Integration
33. More details about integration frameworks...
http://www.kai-waehner.de/blog/2012/12/20/showdown-integration-frameworkspring-integration-apache-camel-vs-enterprise-service-bus-esb/
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
34. Enterprise Integration Patterns (EIP)
Apache Camel
Implements the EIPs
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
38. Choose your required connectors
TCP
SQL
SMTP
Netty
RMI
FTP
Lucene
JMX
MQ
Atom
LDAP
Quartz
XSLT
Akka
Many many more
© Talend 2013
IRC
Log
HTTP
File
EJB
AMQP
RSS
AWS-S3
Jetty
JDBC
Bean-Validation
JMS
CXF
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Custom Components
39. Choose your favorite DSL
XML
(not production-ready yet)
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
40. Deploy it wherever you need
Standalone
Application Server
Web Container
Spring Container
OSGi
Cloud
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
41. Enterprise-ready
• Open Source
• Scalability
• Error Handling
• Transaction
• Monitoring
• Tooling
• Commercial Support
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
43. Hadoop Integration with Apache Camel
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
44. camel-hdfs connector (Java DSL)
// Producer
from("ftp://user@myServer?password=secret")
.to(“hdfs:///myDirectory/myFile.txt?append=true");
// Consumer
from(“hdfs:///myDirectory/myBigDataAnalysis.csv")
.to(“file:target/reports/report.csv");
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
46. Camel and Hadoop File Formats
Choose the appropriate Hadoop file format, e.g. SequenceFile, MapFile, etc...
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
47. camel-avro connector
Camel is not just about connectors ...
Camel supports a pluggable DataFormat to allow messages to be marshalled to and
unmarshalled from binary or text formats, e.g. CSV, JSON, SOAP, EDI, ZIP, or ...
... Avro, a data serialization system used in Apache Hadoop. camel-avro connector
provides a dataformat for Avro, which allows serialization and deserialization of
messages using Apache Avro's binary data format. Moreover, it provides support for
Apache Avro's RPC, by providing producers and consumers endpoint for using Avro
over Netty or HTTP.
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
48. Real World Use Case: Storage: Compliance
Global Parcel Service
A lot of data must be stored „forever“
➜ Numbers increase exponentially
➜ Goal: As cheap as possible
➜ Problem: (Fast) queries must still be possible
➜ Solution: Commodity servers and „Hadoop querying“
➜
http://archive.org/stream/BigDataImPraxiseinsatz-SzenarienBeispieleEffekte/Big_Data_BITKOM-Leitfaden_Sept.2012#page/n0/mode/2up
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
49. Real world use case: Storage: Compliance
Orders
(Server 1)
Payments
(Server 2)
ETL
Log Files
(Server 3)
Log Files
(Server 100)
© Talend 2013
Storage
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Query
50. Live demo
Apache Camel in action...
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
51. camel-pig? camel-hive? camel-hcatalog?
Not available yet (current Camel version: 2.12) Workarounds:
•
•
•
Use Pig / Hive-Query scripts (via camel-exec connector or any
scripting language)
Build your own connector (more details later ...)
Use Hive-Hbase-Integration and store data in HBase ( „ugly“)
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
52. Agenda
• Big data paradigm shift
• Challenges for integrating big data
• Choosing Hadoop for big data integration
• Integration with an open source framework
• Integration with an open source suite
• Custom big data connectors
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
53. Alternatives for systems integration
Integration Suite
Enterprise
Service Bus
Integration
Framework
Low
Connectivity
Routing
Transformation
© Talend 2013
High
+
INTEGRATION
Tooling
Monitoring
Support
+
BUSINESS PROCESS MGT.
BIG DATA / MDM
REGISTRY / REPOSITORY
RULES ENGINE
„YOU NAME IT“
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Complexity
of Integration
54. Alternatives for systems integration
Integration
Framework
Enterprise
Service Bus
Low
© Talend 2013
Integration Suite
High
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Complexity
of Integration
55. More details about ESBs and suites...
http://www.kai-waehner.de/blog/2013/01/23/spoilt-for-choicehow-to-choose-the-right-enterprise-service-bus-esb/
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
56. Hadoop Integration with Talend Open Studio
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
57. Vision: Democratize big data
Talend Open Studio for Big Data
•
•
…an open source
ecosystem
© Talend 2013
Generates Hadoop code and run transforms
inside Hadoop
•
Native support for HDFS, Pig, Hbase, Hcatalog,
Sqoop and Hive
•
Pig
Improves efficiency of big data job design with
graphic interface
100% open source under an Apache License
•
Standards based
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
58. Vision: Democratize big data
Talend Platform for Big Data
•
Builds on Talend Open Studio for Big Data
•
Adds data quality, advanced scalability and
management functions
•
•
© Talend 2013
Data cleansing
•
•
Data quality and profiling
•
…an open source
ecosystem
Shared Repository and remote deployment
•
Pig
MapReduce massively parallel data
processing
Reporting and dashboards
Commercial support, warranty/IP indemnity
under a subscription license
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
59. Talend Open Studio for Big Data
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
60. Real world Use case: Clickstream Analysis
“The advantage of their new system is that they can now look at their data
[from their log processing system] in anyway they want:
➜ Nightly MapReduce jobs collect statistics about their mail system such as
spam counts by domain, bytes transferred and number of logins.
➜ When they wanted to find out which part of the world their customers
logged in from, a quick [ad hoc] MapReduce job was created and they had
the answer within a few hours. Not really possible in your typical ETL
system.”
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
62. Potential Uses of Clickstream Data
One of the original uses of Hadoop at Yahoo was to store and process their massive volume of
clickstream data. Now enterprises of all types can use Hadoop to refine and analyze
clickstream data. They can then answer business questions such as:
•
•
•
What is the most efficient path for a site visitor to research a product, and then buy it?
What products do visitors tend to buy together, and what are they most likely to buy in
the future?
Where should I spend resources on fixing or enhancing the user experience on my
website?
Goal: Data visualization can help you optimize your website and
convert more visits into sales and revenue.
Source: for Clickstream Example: „Hortonworks Hadoop Tutorials - Real Life Use Cases”
http://hortonworks.com/blog/hadoop-tutorials-real-life-use-cases-in-the-sandbox
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
63. Real world use case: Clickstream Analysis
Log Files
(Server 1)
ETL
Log Files
(Server 2)
Log Files
(Server 100)
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
64. Example: ETL Job
„... using Talend’s HDFS and Hive Components”
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
65. Example: ETL Job
„... using Talend’s Map Reduce Components*”
* Not available in open source version of Talend Studio
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
66. Example: Analysis with Microsoft Excel
We can see that the largest number of page hits in Florida were for
clothing, followed by shoes.
Source: for Clickstream Example: „Hortonworks Hadoop Tutorials - Real Life Use Cases”
http://hortonworks.com/blog/hadoop-tutorials-real-life-use-cases-in-the-sandbox
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
67. Example: Analysis with Microsoft Excel
The chart shows that the majority of men shopping for clothing on our
website are between the ages of 22 and 30. With this information, we can
optimize our content for this market segment.
Source: for Clickstream Example: „Hortonworks Hadoop Tutorials - Real Life Use Cases”
http://hortonworks.com/blog/hadoop-tutorials-real-life-use-cases-in-the-sandbox
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
68. Example: Analysis with Tableau
Spoilt for Choice Use your preferred BI or Analysis tool!
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
69. Live demo
„Talend Open Studio for Big Data“ in action...
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
70. Agenda
• Big data paradigm shift
• Challenges for integrating big data
• Choosing Hadoop for big data integration
• Integration with an open source framework
• Integration with an open source suite
• Custom big data connectors
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
71. Custom connectors
Easy to realize for all
integration alternatives *
•
Integration Framework
•
Enterprise Service Bus
•
Integration Suite
* At least for open source solutions
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
72. Custom connectors
You might need a ...
•
... Hive connector for Camel
•
... Impala connector for Talend
•
... custom connector for your
internal data format
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
73. Live demo (Example: Apache Camel)
Custom connectors in action...
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
74. Did you get the key message?
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
75. Key messages
You have to care about big data to be competitive in the future!
You have to integrate different sources to get most value out of it!
Big data integration is no (longer) rocket science!
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
76. Did you get the key message?
© Talend 2013
“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
77. Thank you for your attention.
Questions?
Kai Wähner| Talend
@KaiWaehner
www.kai-waehner.de/blog
Xing / LinkedIn