Building a Big data data warehouse - goto 2013

•

2 gostaram•1,062 visualizações

Content and talk by Friso van Vollenhoven (GoDataDriven) Let's face it: data warehousing in the traditional sense has been tedious, lacking agility, and slow. Often designed and built around a static set of up front questions, they are basically a big database tailored to the application of populating dashboards. The fact that the volume of data that we need to deal with recently exploded by several order of magnitude isn't improving the situation. It's no surprise that we see a new class of data warehousing setups emerge, using big data technologies en NoSQL stores. A nice side effect is that these solutions are usually not only the domain of the BI crowd, but can also be developer friendly and allow development of more data driven apps. In this talk I will present about experiences using Hadoop and other tools from the Hadoop ecosystem, such as Hive, Pig and bare MapReduce, to handle data that grows tens of GBs per day. We create a system where data is captured, stored and made available to different users and use cases, ranging from end users that write SQL queries to software developers that access the underlying data to create data driven products. I will cover topics like ETL, querying, development and deployment and reporting; using a fully open source stack, of course.

Dados e análise Tecnologia

GoDataDriven
PROUDLY PART OF THE XEBIA GROUP
@fzk
frisovanvollenhoven@godatadriven.com
Building a Big Data DWH
Friso van Vollenhoven
CTO
Data Warehousing on Hadoop

-- Wikipedia
“In computing, a data warehouse or
enterprise data warehouse (DW,
DWH, or EDW) is a database used
for reporting and data analysis.”

How to:
• Add a column to the facts table?
• Change the granularity of dates from day
to hour?
• Add a dimension based on some
aggregation of facts?

Schema’s are designed
with questions in mind.
Changing it requires to
redo the ETL.

Schema’s are designed
with questions in mind.
Changing it requires to
redo the ETL.
Push things to the facts
level.
Keep all source data
available all times.

And now?
• MPP databases?
• Faster / better / more SAN?
• (RAC?)

distributed storage
distributed processing
metadata + query engine

• No JVM startup overhead for Hadoop API usage
• Relatively concise syntax (Python)
• Mix Python standard library with any Java libs

• Flexible scheduling with dependencies
• Saves output
• E-mails on errors
• Scales to multiple nodes
• REST API
• Status monitor
• Integrates with version control

•Scheduling
•Simple deployment of ETL code
•Scalable
•Developer friendly

A: Yes, sometimes as
often as 1 in every 10K
calls. Or about once a
week at 3K files / day.

or -2, in Hive
CREATE TABLE browsers (
browser_id STRING,
browser STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-2';

• The format will change
• Faulty deliveries will occur
• Your parser will break
• Records will be mistakingly produced (over-logging)
• Other people test in production too (and you get the
data from it)
• Etc., etc.

•Simple deployment of ETL code
•Scheduling
•Scalable
•Independent jobs
•Fixable data store
•Incremental where possible
•Metrics

Independent jobs
source (external)
staging (HDFS)
hive-staging (HDFS)
Hive
HDFS upload + move in place
MapReduce + HDFS move
Hive map external table + SELECT INTO

Out of order jobs
• At any point, you don’t really know what ‘made it’
to Hive
• Will happen anyway, because some days the data
delivery is going to be three hours late
• Or you get half in the morning and the other half
later in the day
• It really depends on what you do with the data
• This is where metrics + ﬁxable data store help...

Fixable data store
• Using Hive partitions
• Jobs that move data from staging create partitions
• When new data / insight about the data arrives,
drop the partition and re-insert
• Be careful to reset any metrics in this case
• Basically: instead of trying to make everything
transactional, repair afterwards
• Use metrics to determine whether data is ﬁt for
purpose

Metrics service
• Job ran, so may units processed, took so much
time
• e.g. 10GB imported, took 1 hr
• e.g. 60M records transformed, took 10 minutes
• Dropped partition
• Inserted X records into partition

GoDataDriven
We’re hiring / Questions? / Thank you!
@fzk
frisovanvollenhoven@godatadriven.com
Friso van Vollenhoven
CTO

Mais conteúdo relacionado

Destaque

Divolte collector overviewGoDataDriven

I Mapreduced a Neo store: Creating large Neo4j Databases with HadoopGoDataDriven

Bare metal Hadoop provisioningGoDataDriven

Embarrassingly parallel database calls with Python (PyData Paris 2015 )GoDataDriven

Data Science Accelerator ProgramGoDataDriven

Magic, art or science? Deep learning unraveledGoDataDriven

Exercise type detectionGoDataDriven

Yes you can play Monopoly with a Genetic Algorithm - Niels ZeilemakerGoDataDriven

Discovery & Consumption of Analytics Data @TwitterKamran Munshi

Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksGoDataDriven

Destaque (10)

Divolte collector overview

I Mapreduced a Neo store: Creating large Neo4j Databases with Hadoop

Bare metal Hadoop provisioning

Embarrassingly parallel database calls with Python (PyData Paris 2015 )

Data Science Accelerator Program

Magic, art or science? Deep learning unraveled

Exercise type detection

Yes you can play Monopoly with a Genetic Algorithm - Niels Zeilemaker

Discovery & Consumption of Analytics Data @Twitter

Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks

Mais de GoDataDriven

Streamlining Data Science Workflows with a Feature CatalogGoDataDriven

Visualizing Big Data in a Small ScreenGoDataDriven

Building a Scalable and reliable open source ML Platform with MLFlowGoDataDriven

Training Taster: Leading the way to become a data-driven organizationGoDataDriven

My Path From Data Engineer to Analytics EngineerGoDataDriven

dbt Python models - GoDataFest by Guillermo SanchezGoDataDriven

Workshop on Google Cloud Data PlatformGoDataDriven

How to create a Devcontainer for your Python projectGoDataDriven

Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...GoDataDriven

Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022GoDataDriven

MLOps CodeBreakfast on AWS - GoDataFest 2022GoDataDriven

MLOps CodeBreakfast on Azure - GoDataFest 2022GoDataDriven

Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022GoDataDriven

Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022GoDataDriven

AWS Well-Architected Webinar Security - Ben de HaanGoDataDriven

The 7 Habits of Effective Data Driven CompaniesGoDataDriven

DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...GoDataDriven

Artificial intelligence in actions: delivering a new experience to Formula 1 ...GoDataDriven

Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't HofGoDataDriven

Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019GoDataDriven

Mais de GoDataDriven (20)

Streamlining Data Science Workflows with a Feature Catalog

Visualizing Big Data in a Small Screen

Building a Scalable and reliable open source ML Platform with MLFlow

Training Taster: Leading the way to become a data-driven organization

My Path From Data Engineer to Analytics Engineer

dbt Python models - GoDataFest by Guillermo Sanchez

Workshop on Google Cloud Data Platform

How to create a Devcontainer for your Python project

Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...

Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022

MLOps CodeBreakfast on AWS - GoDataFest 2022

MLOps CodeBreakfast on Azure - GoDataFest 2022

Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022

Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022

AWS Well-Architected Webinar Security - Ben de Haan

The 7 Habits of Effective Data Driven Companies

DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...

Artificial intelligence in actions: delivering a new experience to Formula 1 ...

Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof

Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019

Último

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

ASML's Taxonomy Adventure by Daniel Cantervoginip

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff

How we prevented account sharing with MFAAndrei Kaleshka

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

Building a Big data data warehouse - goto 2013

1. GoDataDriven PROUDLY PART OF THE XEBIA GROUP @fzk frisovanvollenhoven@godatadriven.com Building a Big Data DWH Friso van Vollenhoven CTO Data Warehousing on Hadoop

2. -- Wikipedia “In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis.”

6. ETL

8. How to: • Add a column to the facts table? • Change the granularity of dates from day to hour? • Add a dimension based on some aggregation of facts?

9. Schema’s are designed with questions in mind. Changing it requires to redo the ETL.

10. Schema’s are designed with questions in mind. Changing it requires to redo the ETL. Push things to the facts level. Keep all source data available all times.

11.

12. And now? • MPP databases? • Faster / better / more SAN? • (RAC?)

13.

14. distributed storage distributed processing metadata + query engine

15.

16. EXTRACT TRANSFORM LOAD

17.

18.

19.

20.

21.

22. • No JVM startup overhead for Hadoop API usage • Relatively concise syntax (Python) • Mix Python standard library with any Java libs

23.

24. • Flexible scheduling with dependencies • Saves output • E-mails on errors • Scales to multiple nodes • REST API • Status monitor • Integrates with version control

25.

26. Deployment git push jenkins master

27. •Scheduling •Simple deployment of ETL code •Scalable •Developer friendly

28.

29.

30.

31.

32.

33. 'februari-22 2013'

34.

35. A: Yes, sometimes as often as 1 in every 10K calls. Or about once a week at 3K files / day.

36.

37.

38. þ

39. þ

40. TSV == thorn separated values?

41. þ == 0xFE

42. or -2, in Hive CREATE TABLE browsers ( browser_id STRING, browser STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-2';

43.

44.

45.

46.

47.

48.

49. • The format will change • Faulty deliveries will occur • Your parser will break • Records will be mistakingly produced (over-logging) • Other people test in production too (and you get the data from it) • Etc., etc.

50. •Simple deployment of ETL code •Scheduling •Scalable •Independent jobs •Fixable data store •Incremental where possible •Metrics

51. Independent jobs source (external) staging (HDFS) hive-staging (HDFS) Hive HDFS upload + move in place MapReduce + HDFS move Hive map external table + SELECT INTO

52. Out of order jobs • At any point, you don’t really know what ‘made it’ to Hive • Will happen anyway, because some days the data delivery is going to be three hours late • Or you get half in the morning and the other half later in the day • It really depends on what you do with the data • This is where metrics + ﬁxable data store help...

53. Fixable data store • Using Hive partitions • Jobs that move data from staging create partitions • When new data / insight about the data arrives, drop the partition and re-insert • Be careful to reset any metrics in this case • Basically: instead of trying to make everything transactional, repair afterwards • Use metrics to determine whether data is ﬁt for purpose

54. Metrics

55. Metrics service • Job ran, so may units processed, took so much time • e.g. 10GB imported, took 1 hr • e.g. 60M records transformed, took 10 minutes • Dropped partition • Inserted X records into partition

56. GoDataDriven We’re hiring / Questions? / Thank you! @fzk frisovanvollenhoven@godatadriven.com Friso van Vollenhoven CTO

Building a Big data data warehouse - goto 2013

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (10)

Mais de GoDataDriven

Mais de GoDataDriven (20)

Último

Último (20)

Building a Big data data warehouse - goto 2013