O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
1
Scaling ETL
with Hadoop
Gwen Shapira
@gwenshap
gshapira@cloudera.com
Coming soon to a bookstore near you…
• Hadoop Application
Architectures
How to build end-to-end solutions
using Apache Had...
ETL is…
• Extracting data from outside sources
• Transforming it to fit operational needs
• Loading it into the end target...
Hadoop Is…
• HDFS – Massive, redundant data storage
• MapReduce – Batch oriented data processing at scale
• Many many ways...
The Ecosystem
• High level languages and abstractions
• File, relational and streaming data integration
• Process Orchestr...
6
Why ETL with Hadoop?
Data Has Changed in the Last 30 YearsDATAGROWTH
END-USER
APPLICATIONS
THE INTERNET
MOBILE DEVICES
SOPHISTICATED
MACHINES
S...
Volume, Variety, Velocity Cause Problems
8
OLTP
Enterprise
Applications
Data
Warehouse
QueryExtract
Transform
Load
Busines...
Got unstructured data?
• Traditional ETL:
• Text
• CSV
• XLS
• XML
• Hadoop:
• HTML
• XML, RSS
• JSON
• Apache Logs
• Avro...
What is Apache Hadoop?
10
Has the Flexibility to Store and
Mine Any Type of Data
 Ask questions across structured and
uns...
What I often see
ETL
Cluster
ELT in
DWH
ETL in
Hadoop
11
12
Best Practices
Arup Nanda taught me to ask:
1. Why is it better than the rest?
2. What happens if it is not followed?
3. W...
14
15
Extract
Let me count the ways
1. From Databases: Sqoop
2. Log Data: Flume
3. Copy data to HDFS
16
Data Loading Mistake #1
17
Hadoop is scalable.
Lets run as many Sqoop mappers
as possible, to get the data from
our DB fas...
Result:
18
Lesson:
• Start with 2 mappers, add slowly
• Watch DB load and network utilization
• Use FairScheduler to limit number of ...
Data Loading Mistake #2
20
Database specific connectors are
complicated and scary.
Lets just use the default JDBC
connecto...
Result:
21
Lesson:
1. There are connectors to:
Oracle, Netezza and Teradata
2. Download them
3. Read documentation
4. Ask questions i...
Data Loading Mistake #3
23
Just copying files?
This sounds too simple. We
probably need some cool
whizzbang tool.
— Famous...
Result
24
Lessons:
• Copying files is a legitimate solution
• In general, simple is good
25
26
Transform
Endless Possibilities
• Map Reduce
• Crunch / Cascading
• Spark
• Hive (i.e. SQL)
• Pig
• R
• Shell scripts
• Plain old Ja...
Data Processing Mistake #0
28
Data Processing Mistake #1
29
This system must be ready in 12 month.
We have to convert 100 data sources and 5000
transfor...
Result
30
Lessons
• Take learning curve into account
• You don’t know what you don’t know
• Hadoop will be difficult and frustrating...
Data Processing Mistake #2
32
Hadoop is all about MapReduce.
So I’ll use MapReduce for all my
data processing needs.
— Fam...
Result:
33
Lessons:
MapReduce is the assembly language of Hadoop:
Simple things are hard.
Hard things are possible.
34
Data Processing Mistake #3
35
I got 5000 tiny XMLs, and Hadoop
is great at processing
unstructured data.
So I’ll just leav...
Result
36
Lessons
1. Consolidate small files
2. Don’t argue about #1
3. Convert files to easy-to-query formats
4. De-normalize
37
Data Processing Mistake #4
38
Partitions are for relational databases
— Famous last words
Result
39
Lessons
1. Without partitions every query is a full table scan
2. Yes, Hadoop scans fast.
3. But faster read is the one yo...
41
Load
Technologies
• Sqoop
• Fuse-DFS
• Oracle Connectors
• Just copy files
• Query Hadoop
42
Data Loading Mistake #1
43
All of the data must end up in a
relational DWH.
— Famous last words
Result
44
Lessons:
• Use Relational:
• To maintain tool compatibility
• DWH enrichment
• Stay in Hadoop for:
• Text search
• Graph a...
Data Loading Mistake #2
46
We used Sqoop to get data out of
Oracle. Lets use Sqoop to get it back in.
— Famous last words
Result
47
Lesson
Use Oracle direct connectors if you can afford them.
They are:
1. Faster than any alternative
2. Use Hadoop to make...
49
Workflow Management
Tools
• Oozie
• Pentaho, Talend, ActiveBatch, AutoSys, Informatica,
UC4, Cron
50
Workflow Mistake #1
51
Workflow management is easy. I’ll just
write few scripts.
— Famous last words
52
— Josh Wills
Lesson:
Workflow management tool should enable:
• Keeping track of metadata, components and
integrations
• Scheduling and ...
Workflow Mistake #2
54
Schema? This is Hadoop. Why would
we need a schema?
— Famous last words
Result
55
Lesson
/user/…
/user/gshapira/testdata/orders
/data/<database>/<table>/<partition>
/data/<biz unit>/<app>/<dataset>/partit...
Workflow Mistake #3
57
Oozie was written for Hadoop, so the
right solution will always use Oozie
— Famous last words
Result
58
Lessons:
• Oozie has advantages
• Use the tool that works for you
59
Hue + Oozie
60
61
— Neil Gaiman
62
— Esther Dyson
63
Should DBAs learn Hadoop?
• Hadoop projects are more visible
• 48% of Hadoop clusters are owned by DWH team
• Big Data == ...
Beginner Projects
• Take a class
• Download a VM
• Install 5 node Hadoop cluster in AWS
• Load data:
• Complete works of S...
Books
66
More Books
67
Próximos SlideShares
Carregando em…5
×

Scaling ETL with Hadoop - Avoiding Failure

3.154 visualizações

Publicada em

Publicada em: Software, Tecnologia
  • Seja o primeiro a comentar

Scaling ETL with Hadoop - Avoiding Failure

  1. 1. 1 Scaling ETL with Hadoop Gwen Shapira @gwenshap gshapira@cloudera.com
  2. 2. Coming soon to a bookstore near you… • Hadoop Application Architectures How to build end-to-end solutions using Apache Hadoop and related tools @hadooparchbook www.hadooparchitecturebook.com
  3. 3. ETL is… • Extracting data from outside sources • Transforming it to fit operational needs • Loading it into the end target • (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load) 3
  4. 4. Hadoop Is… • HDFS – Massive, redundant data storage • MapReduce – Batch oriented data processing at scale • Many many ways to process data in parallel at scale 4
  5. 5. The Ecosystem • High level languages and abstractions • File, relational and streaming data integration • Process Orchestration and Scheduling • Libraries for data wrangling • Low latency query language 5
  6. 6. 6 Why ETL with Hadoop?
  7. 7. Data Has Changed in the Last 30 YearsDATAGROWTH END-USER APPLICATIONS THE INTERNET MOBILE DEVICES SOPHISTICATED MACHINES STRUCTURED DATA – 10% 1980 2013 UNSTRUCTURED DATA – 90%
  8. 8. Volume, Variety, Velocity Cause Problems 8 OLTP Enterprise Applications Data Warehouse QueryExtract Transform Load Business Intelligence Transform 1 1 1 Slow data transformations. Missed SLAs. 2 2 Slow queries. Frustrated business and IT. 3 Must archive. Archived data can’t provide value.
  9. 9. Got unstructured data? • Traditional ETL: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF 9
  10. 10. What is Apache Hadoop? 10 Has the Flexibility to Store and Mine Any Type of Data  Ask questions across structured and unstructured data that were previously impossible to ask or solve  Not bound by a single schema Excels at Processing Complex Data  Scale-out architecture divides workloads across multiple nodes  Flexible file system eliminates ETL bottlenecks Scales Economically  Can be deployed on commodity hardware  Open source platform guards against vendor lock Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce Distributed Computing Framework Apache Hadoop is an open source platform for data storage and processing that is…  Distributed  Fault tolerant  Scalable CORE HADOOP SYSTEM COMPONENTS
  11. 11. What I often see ETL Cluster ELT in DWH ETL in Hadoop 11
  12. 12. 12
  13. 13. Best Practices Arup Nanda taught me to ask: 1. Why is it better than the rest? 2. What happens if it is not followed? 3. When are they not applicable? 13
  14. 14. 14
  15. 15. 15 Extract
  16. 16. Let me count the ways 1. From Databases: Sqoop 2. Log Data: Flume 3. Copy data to HDFS 16
  17. 17. Data Loading Mistake #1 17 Hadoop is scalable. Lets run as many Sqoop mappers as possible, to get the data from our DB faster! — Famous last words
  18. 18. Result: 18
  19. 19. Lesson: • Start with 2 mappers, add slowly • Watch DB load and network utilization • Use FairScheduler to limit number of mappers 19
  20. 20. Data Loading Mistake #2 20 Database specific connectors are complicated and scary. Lets just use the default JDBC connector. — Famous last words
  21. 21. Result: 21
  22. 22. Lesson: 1. There are connectors to: Oracle, Netezza and Teradata 2. Download them 3. Read documentation 4. Ask questions if not clear 5. Follow installation instructions 6. Use Sqoop with connectors 22
  23. 23. Data Loading Mistake #3 23 Just copying files? This sounds too simple. We probably need some cool whizzbang tool. — Famous last words
  24. 24. Result 24
  25. 25. Lessons: • Copying files is a legitimate solution • In general, simple is good 25
  26. 26. 26 Transform
  27. 27. Endless Possibilities • Map Reduce • Crunch / Cascading • Spark • Hive (i.e. SQL) • Pig • R • Shell scripts • Plain old Java 27
  28. 28. Data Processing Mistake #0 28
  29. 29. Data Processing Mistake #1 29 This system must be ready in 12 month. We have to convert 100 data sources and 5000 transformations to Hadoop. Lets spend 2 days planning a schedule and budget for the entire year and then just go and implement it. Prototype? Who needs that? — Famous last words
  30. 30. Result 30
  31. 31. Lessons • Take learning curve into account • You don’t know what you don’t know • Hadoop will be difficult and frustrating for at least 3 month. 31
  32. 32. Data Processing Mistake #2 32 Hadoop is all about MapReduce. So I’ll use MapReduce for all my data processing needs. — Famous last words
  33. 33. Result: 33
  34. 34. Lessons: MapReduce is the assembly language of Hadoop: Simple things are hard. Hard things are possible. 34
  35. 35. Data Processing Mistake #3 35 I got 5000 tiny XMLs, and Hadoop is great at processing unstructured data. So I’ll just leave the data like that and parse the XML in every job. — Famous last words
  36. 36. Result 36
  37. 37. Lessons 1. Consolidate small files 2. Don’t argue about #1 3. Convert files to easy-to-query formats 4. De-normalize 37
  38. 38. Data Processing Mistake #4 38 Partitions are for relational databases — Famous last words
  39. 39. Result 39
  40. 40. Lessons 1. Without partitions every query is a full table scan 2. Yes, Hadoop scans fast. 3. But faster read is the one you don’t perform 4. Cheap storage allows you to store same dataset, partitioned multiple ways. 5. Use partitions for fast data loading 40
  41. 41. 41 Load
  42. 42. Technologies • Sqoop • Fuse-DFS • Oracle Connectors • Just copy files • Query Hadoop 42
  43. 43. Data Loading Mistake #1 43 All of the data must end up in a relational DWH. — Famous last words
  44. 44. Result 44
  45. 45. Lessons: • Use Relational: • To maintain tool compatibility • DWH enrichment • Stay in Hadoop for: • Text search • Graph analysis • Reduce time in pipeline • Big data & small network • Congested database 45
  46. 46. Data Loading Mistake #2 46 We used Sqoop to get data out of Oracle. Lets use Sqoop to get it back in. — Famous last words
  47. 47. Result 47
  48. 48. Lesson Use Oracle direct connectors if you can afford them. They are: 1. Faster than any alternative 2. Use Hadoop to make Oracle more efficient 3. Make *you* more efficient 48
  49. 49. 49 Workflow Management
  50. 50. Tools • Oozie • Pentaho, Talend, ActiveBatch, AutoSys, Informatica, UC4, Cron 50
  51. 51. Workflow Mistake #1 51 Workflow management is easy. I’ll just write few scripts. — Famous last words
  52. 52. 52 — Josh Wills
  53. 53. Lesson: Workflow management tool should enable: • Keeping track of metadata, components and integrations • Scheduling and Orchestration • Restarts and retries • Cohesive System View • Instrumentation, Measurement and Monitoring • Reporting 53
  54. 54. Workflow Mistake #2 54 Schema? This is Hadoop. Why would we need a schema? — Famous last words
  55. 55. Result 55
  56. 56. Lesson /user/… /user/gshapira/testdata/orders /data/<database>/<table>/<partition> /data/<biz unit>/<app>/<dataset>/partition /data/pharmacy/fraud/orders/date=20131101 /etl/<biz unit>/<app>/<dataset>/<stage> /etl/pharmacy/fraud/orders/validated 56
  57. 57. Workflow Mistake #3 57 Oozie was written for Hadoop, so the right solution will always use Oozie — Famous last words
  58. 58. Result 58
  59. 59. Lessons: • Oozie has advantages • Use the tool that works for you 59
  60. 60. Hue + Oozie 60
  61. 61. 61 — Neil Gaiman
  62. 62. 62 — Esther Dyson
  63. 63. 63
  64. 64. Should DBAs learn Hadoop? • Hadoop projects are more visible • 48% of Hadoop clusters are owned by DWH team • Big Data == Business pays attention to data • New skills – from coding to cluster administration • Interesting projects • No, you don’t need to learn Java 64
  65. 65. Beginner Projects • Take a class • Download a VM • Install 5 node Hadoop cluster in AWS • Load data: • Complete works of Shakespeare • Movielens database • Find the 10 most common words in Shakespeare • Find the 10 most recommended movies • Run TPC-H • Cloudera Data Science Challenge • Actual use-case: XML ingestion, ETL process, DWH history 65
  66. 66. Books 66
  67. 67. More Books 67

×