O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017

Carregando em…3

Confira estes a seguir

1 de 33 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017 (20)


Mais de Lviv Startup Club (20)

Mais recentes (20)


Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017

  1. 1. Data Lake Architecture A guide by Artur Fejklowicz
  2. 2. Whoami I work at TVN S.A., which is the biggest polish commercial TV broadcaster. We have broad variety of TV channels (linear TV) like TVN, TVN24, HGTV, TVN Style, TVN Turbo. We produce great content, which we share at our VoD platform player.pl (non-linear TV). www.tvn24.pl is our news portal, where people can watch live stream of TVN24 channel and read breaking news. As BigData Solutions Architect I lead Data Engineers team. We support Data Lake and make data available for business users. We also build data processing pipelines. I am Cloudera Certified Administrator for Apache Hadoop. https://www.linkedin.com/in/arturr/
  3. 3. How to start with the Data Lake Tip Dig great hole and ask friends to dive into the lake with you.
  4. 4. Business clients Write down who/what will use the Data Lake. It can be Data Scientists, Data Analysts or applications. It will affect SLA’s and security levels. They will take the responsibility for collecting the information out of data and their proper interpretation.
  5. 5. Data sources inventory Check sizing Daily size will let you know how big storage you will need for the data. HDFS stores 3 copies by default, but data can be compressed. Format What is current data format (CSV, some log with space separated fields, multiline, stream format (AVRO, JSON)) Interfaces External/internal, firewall issues, security, encoding Ingest frequency How often does the business users want the date to be available for them. For analytics it is usually 1day. For online tracking it might be 1minute, 10s or even less then 1s.
  6. 6. Teams definitions Data Scientists Responsible for asking the right questions against data with properly build/choose algorithms to find the right answers. Need to have crossdomain skills/experience to see the “big picture”. Usually should know: machine learning, programming in Java/Scala/Python, R, SQL, also should know statistical analysis, conceptual modeling. "A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.” Josh Wills former Director of Data Science at Cloudera Data Analysts Helps Data Scientists in their jobs. Performs analysis on data sets from particular system. Should know data mining and how to prepare summary reports also should have high skills in using Business Intelligence tools like Tableau or Qlik. “Junior Data Scientist” Data Engineers Builds data processing systems with Hadoop using self written applications in Java/Scala/Python for Spark or using software like Hive, Sqoop, Kafka, Flume or NiFi. They decide about hardware and software needs. They keep the Data Platform up and running.
  7. 7. Technology Hadoop distribution Cloudera, HortonWorks, MapR, self build, cloud, other. Services to install Depends on business user needs. For example if you plan to use complex processing pipelines install Luigi or Jenkins instead of Oozie. Less number of services is better for fast start with Data Lake and later support. BI Tools Depends on business user needs.
  8. 8. Security Securing your Data Lake is always a good idea, but will delay the start of Data Lake, so you need to balance when to enable security. The earlier you secure your Hadoop Data Lake the less problems it will create, because Data Lake will grow fast with number of users, data sources, ETL’s and services. By default Hadoop trusts that user presented at login is that user. Securing Hadoop is at least to enable Kerberos authentication (often with LDAP/AD backend). Full AAA Security model with active auditing is very time consuming to implement and might need commercial support. Authentication needs Kerberos Authorization needs Sentry or Ranger. Auditing can be passive (good for start) or active.
  9. 9. Infrastructure Usually it is better to have more smaller servers then less strong ones - we want to parallel the computations. You might need strong boxes (with great amount of RAM) for services like Impala or Spark, that will do in-memory processing. It is always a good idea to have 4x 1Gbps network ports as LACP bond if your LAN switches supports this. It is also good idea to plan HDFS tiering or to send “cold data” to cheap cloud storage. Put worker nodes (HDFS DataNode + YARN NodeManagers) on bare metal machines. Though masters virtualization is something to consider.
  10. 10. Data Lake architecture Data engineering
  11. 11. Data Lake environments Minimum 2 environments (RC and Prod), the best is to have minimum 3. Environment is Hadoop cluster + BI Tools + ETL jobs and any other services you will change/implement in Production. 1. Testing (R&D) - You can test new services there, new functionalities. It can be shutdown any time. Users can have their own environments for their own tests. It might be totally virtualized. 2. Release Candidate (RC) - Has the same configuration as Production, but with minimal hardware resources - can be virtualized). For testing software upgrades, and configuration changes. For example, when you implement Kerberos authentication this is a “must have“ environment. Access has only selected users, that need to prepare production change. 3. Production - Business users must obey the rules in order to keep the Data Lake services SLA’s.
  12. 12. Storage file formats 1. Row (easily human readable, slow) a. Used in RDBMS for SUID queries, that works on many columns at a time b. Not so good for aggregations, because you need to do full table scan c. Not so good for compression, because in one row are neighboring various data types 1,1,search,122.303.444,t_{0},/products/search/filters=tshirts,US; --row 1 2,1,view,122.303.444,t_{1},/products/id=11212,US; --row 2 ...; 7,9,checkout,121.322.432,t_{2},/checkout/cart=478/status=confirmed,UK; --row 7 2. Columnar (very hard to read by human, fast) a. Values from particular column are stored as neighbors to each other. b. Very good for aggregations because you have to only get blocks of columns, without full table scan c. Very good for compression, the same value types are in one row (ex. Runlength encoding for integers, Dictionary for strings). 1,2,3,4,5,6,7; 1,3,9; Search,view,add_to_cart,checkout; ...; US,FR,UK;
  13. 13. Storage file formats - TEXTFILE TEXTFILE Row format. Very good for start. Good for storing CSV/TSV. CREATE EXTERNAL TABLE `mydb.peoples_transactions` ( `id` INT COMMENT 'Personal id', `name` STRING COMMENT 'Name of a person', `day` STRING COMMENT 'Transaction date' ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION '/externals/mydb/peoples_transactions';
  14. 14. Storage file formats - Avro - Selfdescribing - schema can be part of the data file. - Supports column aliases for schema evolution. - Schema can be defined in file (SERDEPROPERTIES) or in table. definition (TABLEPROPERTIES). - Easy transformation from/to json. CREATE TEMPORARY EXTERNAL TABLE temporary_myavro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.url'='/avro_schemas/myavro.avsc') STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/externals/myavro'; TABLEPROPERTIES ('avro.schema.literal'='{ "namespace": "testing.hive.avro.serde", "name": "peoples_transactions", "type": "record", "fields": [ { "Name":"id", "type":"int", "doc":"Personal id" }, { "Name":"name", "type":"string", "doc":"Name of a person" }, { "Name":"day", "type":"string", "doc":"Transaction date" }, ], "doc":"Table with transactions" })
  15. 15. Storage file formats - Parquet Parquet The most widely used columnar file format. Supported by Cloudera’s Impala in memory engine, Hive and Spark. Has basic statistic - number of elements in column stored in particular row group. Very good for Spark when using Tungsten.
  16. 16. Storage file formats - ORC ORC - Has 3 index levels (file, stripe and 10k rows). - Even 78% smaller files. - Basic statistics: min, max, sum, count per column, per stripe and file. - When inserting into table try to sort data by most used column. - Supports predicate pushdown.
  17. 17. Storage file formats - orcdump beeline --service orcfiledump /user/hive/warehouse/mydb.db/mytable/day=2017-04-29/000000_0 … Column 115: count: 5000 hasNull: false min: max: 0 sum: 2128 Column 116: count: 5000 hasNull: false min: max: 992 sum: 7811 Stripe 10: Column 0: count: 5000 hasNull: false Column 1: count: 0 hasNull: true sum: 0 … File Statistics: Column 0: count: 545034 hasNull: false Column 1: count: 0 hasNull: true sum: 0 … Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 … Encoding column 115: DICTIONARY_V2[1] Encoding column 116: DICTIONARY_V2[1]
  18. 18. Storage file formats - example ORC write and load into DataFrame import org.apache.spark.sql._ val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) case class Contact(name: String, phone: String) case class Person(name: String, age: Int, contacts: Seq[Contact]) val records = (1 to 100).map { i =>; Person(s"name_$i", i, (0 to 1).map { m => Contact(s"contact_$m", s"phone_$m") }) } sc.parallelize(records).toDF().write.format("orc").save("people") val people = sqlContext.read.format("orc").load("people") people.registerTempTable("people") Predicate pushdown example sqlContext.setConf("spark.sql.orc.filterPushdown", "true") sqlContext.sql("SELECT name FROM peopleWHERE age < 15").count()
  19. 19. Storage file formats - compression None - fast data access, but large files sizes and large network bandwidth. SNAPPY - written by Google. Very fast, but not splittable - decoding will be done on one CPU Core. Compression level about 40%. Usually the best choice. Gzip - Good compression ratio. Slower than Snappy. High CPU usage.
  20. 20. Batch ingestion There are various methods to do batch ingestion depending from the source of data. Sqoop is used mostly to import from RDBMS. Direct files upload to Hive is also possible. There are also other tools like NiFi to monitor and orchestrate the ingestion.
  21. 21. Batch ingestion - sqoop - Very good for copying tables from RDBMS into HDFS files or Hive tables. - Number of output files can be steered by number of mappers used. - Connects using JDBC drivers for particular database but not only. - By default stores data in Textfiles, SequenceFiles and AVRO. - Supports HCatalog - useful for import to other storage formats. - Supports incremental import based on last-value (append and lastmodified). - You can specify query to be imported. - Supports compression Gzip (by default not enabled) and other algorithms. - Carefull must be taken for many exceptions when importing, for example: - Fields in database can have new lines characters this can be problem when importing into Hive, where table rows delimiter is also ‘n’ - Nulls by default are treated as string ‘null’, Hive uses N ( --null-string (string columns) and --null-non-string (not string columns) with escaping )
  22. 22. Batch ingestion - sqoop into ORC sqoop import -Dmapreduce.job.queuename=your.yarn.scheduler.queuename --connect jdbc:mysql://myserver:3306/mydatabase -username srvuser --password-file file:///path/to/.sqoop.pw.file -m 1 --null-string "N" --null-non-string "N" --table mytable --hcatalog-database myhivedb --hcatalog-table mytable --hcatalog-partition-values "`date +%F`" --hcatalog-partition-keys 'day' --hcatalog-storage-stanza 'STORED AS ORCFILE'
  23. 23. Batch ingestion - Hive 1. Import to external table a. Copy files into the HDFS to external table location directory using hdfs dfs -put. b. Partitioning is a good practice. Usually by date, for example: /externals/myparttab/y=2017/m=04/d=29/ After uploading file to a new partition you need to create this partition in Hive metastore with MSCK REPAIR TABLE command. 2. Import to optimized storage SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; INSERT OVERWRITE TABLE mydb.mytable PARTITION(`day`) SELECT `id`, `name`, `day` FROM mydb.mytable_externtal;
  24. 24. Batch ingestion - NiFi Powerful tool for controlling and monitoring the data flow. In GUI you build graph of configurable processors and their relationships. You can change data formats on the fly (ex. JSON into AVRO). Has many processors for HDFS, Hive, Kafka, Flume, JDBC interfaces, and many others. Might be used for both batch (Interval or Cron based) or stream processing.
  25. 25. Streaming ingestion 1. Stream source (ex. a farm of web servers or connected devices) 2. Message broker (ex. Kafka) 3. Stream transport component (Flume, NiFi) 4. Stream processing engine (ex. Spark, Storm, Flink) 5. Messages format (JSON, AVRO, TEXT, Compression) Typical streaming ingestion configuration Data source source Kafka channel Flume Agent source Kafka channel Flume Agent Kafka cluster Partition1 ReplicaP1 Partition2 ReplicaP2 HDFS sink Kafka channel Flume Agent HDFS Hive external table Spark, Flink, Storm... round-robin
  26. 26. Data export Hive export to Comma/Tab/Delimiter separated values formats. CONNSTRING="jdbc:hive2://my.hive.com:10000/;principal=hive/my.hive.com@MYREALM.COM?mapreduce.job.queuename=my .queue" DAY=`date +%F` /usr/bin/beeline -u ${CONNSTRING} --outputformat=tsv2 --showHeader=false --hivevar DAY=$DAY -e "SELECT * FROM mydb.mytable where day='${hivevar:DAY}'" > mytable.${DAY}.tsv For DSV --delimiterForDSV='ł'
  27. 27. Data export - Spark Complex export can be run in Spark or MapReduce application. For example the easiest way to export to RDBMS from Spark is direct write from DataFrame: val prop = new java.util.Properties val jurl = "jdbc:sqlserver://my.sql.com:1433;databaseName=mydb" val rdbmsTab = "mytab" def main(args: Array[String]): Unit = { prop.setProperty("user", "myuser") prop.setProperty("password", "XXX") prop.setProperty("driver", "rdbms.jdbc.driver") val sc = new SparkContext() val sqlContext = new HiveContext(sc) val myDF = sqlContext.sql(""" SELECT country, count(id), day FROM mydb.mytab WHERE day < from_unixtime(unix_timestamp(),'yyyy-MM-dd') GROUP BY day, country """) myDF.write.mode("overwrite").jdbc(jurl, rdbmsTab, prop) } If you need to make UPDATES you need to use default JDBC DriverManager, because DataFrames can write in "error", "append", "overwrite" and "ignore" modes.
  28. 28. Machine Learning model livecycle Livecycle basing on Spark 1. Data Scientist trains the model save it as PMML or Spark ML Pipeline 2. Depending on the need, ML model can be used by Data Engineer in order to: a. Once a day recalculate data and export them for example to RDBMS. b. Load saved model and expose it for example via REST API, or update in-memory store like Druid. c. Use Oryx for online model upgrades. 3. Data Scientists must have access to mesure model effectiveness.
  29. 29. Oryx architecture
  30. 30. Spark considerations - DF better than RDD (collection of java objects) - Don't cache because of serialization. - When using Spark streaming it is better to log the error to a batch then throw it 100k times. - Kafka’s best message size is 10k - 100k. Even 2 rows in one transaction are better than one.
  31. 31. Environments for DA and DS - Hue - primary tool for Analysts and Data Scientists. - Beeline for accessing Hive from CLI. - JDBC for connecting to Hive tables from Excel. - Self-service BI tools (Tableau, Qlik, etc.). - Jupyter - notebook for DataScientists. Can run Scala/Python/R in one flow.
  32. 32. What’s next - DataLake 3.0 (Horton Works) - Application assembly - run multiple services in dockerized containers on YARN. Each can have it’s own environment. - Auto-tiering - for automatic data movement between tiers. - Network and IO level isolation. - Cloudera Data Science Workbench - Collaborative platform for Data Scientists. - Generally available since June 2017. - Spark 2 - SparkContext and HiveContext are rebuild into SparkSession. - Adds spark-csv library. In Spark 1.x you have to use library from Databricks manually. - Global temporary views available for other sessions.
  33. 33. Thank you