SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Data Lake Architecture
A guide by Artur Fejklowicz
Whoami
I work at TVN S.A., which is the biggest polish commercial TV broadcaster. We
have broad variety of TV channels (linear TV) like TVN, TVN24, HGTV, TVN Style,
TVN Turbo. We produce great content, which we share at our VoD platform
player.pl (non-linear TV). www.tvn24.pl is our news portal, where people can watch
live stream of TVN24 channel and read breaking news.
As BigData Solutions Architect I lead Data Engineers team.
We support Data Lake and make data available for business users.
We also build data processing pipelines.
I am Cloudera Certified Administrator for Apache Hadoop.
https://www.linkedin.com/in/arturr/
How to start with
the Data Lake
Tip
Dig great hole and ask
friends to dive into the
lake with you.
Business clients
Write down who/what will use the Data Lake.
It can be Data Scientists, Data Analysts or
applications.
It will affect SLA’s and security levels.
They will take the responsibility for collecting the
information out of data and their proper
interpretation.
Data sources inventory
Check sizing
Daily size will let you know how big storage you will need for the data. HDFS stores 3 copies by
default, but data can be compressed.
Format
What is current data format (CSV, some log with space separated fields, multiline, stream format
(AVRO, JSON))
Interfaces
External/internal, firewall issues, security, encoding
Ingest frequency
How often does the business users want the date to be available for them.
For analytics it is usually 1day. For online tracking it might be 1minute, 10s or even less then 1s.
Teams definitions
Data Scientists
Responsible for asking the right questions against data with properly build/choose algorithms to find the right answers. Need to have
crossdomain skills/experience to see the “big picture”. Usually should know: machine learning, programming in Java/Scala/Python, R, SQL, also
should know statistical analysis, conceptual modeling.
"A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.”
Josh Wills former Director of Data Science at Cloudera
Data Analysts
Helps Data Scientists in their jobs. Performs analysis on data sets from particular system. Should know data mining and how to prepare
summary reports also should have high skills in using Business Intelligence tools like Tableau or Qlik.
“Junior Data Scientist”
Data Engineers
Builds data processing systems with Hadoop using self written applications in Java/Scala/Python for Spark or using software like Hive, Sqoop,
Kafka, Flume or NiFi. They decide about hardware and software needs. They keep the Data Platform up and running.
Technology
Hadoop distribution
Cloudera, HortonWorks, MapR, self build, cloud, other.
Services to install
Depends on business user needs. For example if you plan to use complex processing pipelines install Luigi or Jenkins
instead of Oozie. Less number of services is better for fast start with Data Lake and later support.
BI Tools
Depends on business user needs.
Security
Securing your Data Lake is always a good idea, but will delay the start of Data Lake, so you need to balance when to enable
security.
The earlier you secure your Hadoop Data Lake the less problems it will create, because Data Lake will grow fast with
number of users, data sources, ETL’s and services.
By default Hadoop trusts that user presented at login is that user.
Securing Hadoop is at least to enable Kerberos authentication (often with LDAP/AD backend).
Full AAA Security model with active auditing is very time consuming to implement and might need commercial support.
Authentication needs Kerberos
Authorization needs Sentry or Ranger.
Auditing can be passive (good for start) or active.
Infrastructure
Usually it is better to have more smaller servers then less strong ones - we want to parallel the computations.
You might need strong boxes (with great amount of RAM) for services like Impala or Spark, that will do in-memory
processing.
It is always a good idea to have 4x 1Gbps network ports as LACP bond if your LAN switches supports this.
It is also good idea to plan HDFS tiering or to send “cold data” to cheap cloud storage.
Put worker nodes (HDFS DataNode + YARN NodeManagers) on bare metal machines. Though masters virtualization is
something to consider.
Data Lake architecture
Data engineering
Data Lake environments
Minimum 2 environments (RC and Prod), the best is to have minimum 3.
Environment is Hadoop cluster + BI Tools + ETL jobs and any other services you will change/implement in Production.
1. Testing (R&D) - You can test new services there, new functionalities. It can be shutdown any time. Users can have
their own environments for their own tests. It might be totally virtualized.
2. Release Candidate (RC) - Has the same configuration as Production, but with minimal hardware resources - can be
virtualized). For testing software upgrades, and configuration changes. For example, when you implement Kerberos
authentication this is a “must have“ environment. Access has only selected users, that need to prepare production
change.
3. Production - Business users must obey the rules in order to keep the Data Lake services SLA’s.
Storage file formats
1. Row (easily human readable, slow)
a. Used in RDBMS for SUID queries, that works on many columns at a time
b. Not so good for aggregations, because you need to do full table scan
c. Not so good for compression, because in one row are neighboring various data types
1,1,search,122.303.444,t_{0},/products/search/filters=tshirts,US; --row 1
2,1,view,122.303.444,t_{1},/products/id=11212,US; --row 2
...;
7,9,checkout,121.322.432,t_{2},/checkout/cart=478/status=confirmed,UK; --row 7
2. Columnar (very hard to read by human, fast)
a. Values from particular column are stored as neighbors to each other.
b. Very good for aggregations because you have to only get blocks of columns, without full table scan
c. Very good for compression, the same value types are in one row (ex. Runlength encoding for integers, Dictionary for strings).
1,2,3,4,5,6,7;
1,3,9;
Search,view,add_to_cart,checkout;
...;
US,FR,UK;
Storage file formats - TEXTFILE
TEXTFILE
Row format. Very good for start. Good for storing CSV/TSV.
CREATE EXTERNAL TABLE `mydb.peoples_transactions` (
`id` INT COMMENT 'Personal id',
`name` STRING COMMENT 'Name of a person',
`day` STRING COMMENT 'Transaction date' )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
LOCATION '/externals/mydb/peoples_transactions';
Storage file formats - Avro
- Selfdescribing - schema can be part of the data file.
- Supports column aliases for schema evolution.
- Schema can be defined in file (SERDEPROPERTIES) or in table.
definition (TABLEPROPERTIES).
- Easy transformation from/to json.
CREATE TEMPORARY EXTERNAL TABLE temporary_myavro
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES
('avro.schema.url'='/avro_schemas/myavro.avsc')
STORED as INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/externals/myavro';
TABLEPROPERTIES ('avro.schema.literal'='{
"namespace": "testing.hive.avro.serde",
"name": "peoples_transactions",
"type": "record",
"fields": [
{ "Name":"id", "type":"int",
"doc":"Personal id" },
{ "Name":"name", "type":"string",
"doc":"Name of a person" },
{ "Name":"day", "type":"string",
"doc":"Transaction date" },
],
"doc":"Table with transactions"
})
Storage file formats - Parquet
Parquet
The most widely used columnar file format. Supported by
Cloudera’s Impala in memory engine, Hive and Spark. Has basic
statistic - number of elements in column stored in particular row
group. Very good for Spark when using Tungsten.
Storage file formats - ORC
ORC
- Has 3 index levels (file, stripe and 10k rows).
- Even 78% smaller files.
- Basic statistics: min, max, sum, count per column, per stripe and file.
- When inserting into table try to sort data by most used column.
- Supports predicate pushdown.
Storage file formats - orcdump
beeline --service orcfiledump /user/hive/warehouse/mydb.db/mytable/day=2017-04-29/000000_0
…
Column 115: count: 5000 hasNull: false min: max: 0 sum: 2128
Column 116: count: 5000 hasNull: false min: max: 992 sum: 7811
Stripe 10:
Column 0: count: 5000 hasNull: false
Column 1: count: 0 hasNull: true sum: 0
…
File Statistics:
Column 0: count: 545034 hasNull: false
Column 1: count: 0 hasNull: true sum: 0
…
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
…
Encoding column 115: DICTIONARY_V2[1]
Encoding column 116: DICTIONARY_V2[1]
Storage file formats - example
ORC write and load into DataFrame
import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
case class Contact(name: String, phone: String)
case class Person(name: String, age: Int, contacts: Seq[Contact])
val records = (1 to 100).map { i =>;
Person(s"name_$i", i, (0 to 1).map { m => Contact(s"contact_$m", s"phone_$m") })
}
sc.parallelize(records).toDF().write.format("orc").save("people")
val people = sqlContext.read.format("orc").load("people")
people.registerTempTable("people")
Predicate pushdown example
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
sqlContext.sql("SELECT name FROM peopleWHERE age < 15").count()
Storage file formats - compression
None - fast data access, but large files sizes and large network
bandwidth.
SNAPPY - written by Google. Very fast, but not splittable -
decoding will be done on one CPU Core. Compression level about
40%. Usually the best choice.
Gzip - Good compression ratio. Slower than Snappy. High CPU
usage.
Batch ingestion
There are various methods to do batch ingestion depending from
the source of data. Sqoop is used mostly to import from RDBMS.
Direct files upload to Hive is also possible. There are also other
tools like NiFi to monitor and orchestrate the ingestion.
Batch ingestion - sqoop
- Very good for copying tables from RDBMS into HDFS files or Hive tables.
- Number of output files can be steered by number of mappers used.
- Connects using JDBC drivers for particular database but not only.
- By default stores data in Textfiles, SequenceFiles and AVRO.
- Supports HCatalog - useful for import to other storage formats.
- Supports incremental import based on last-value (append and lastmodified).
- You can specify query to be imported.
- Supports compression Gzip (by default not enabled) and other algorithms.
- Carefull must be taken for many exceptions when importing, for example:
- Fields in database can have new lines characters this can be problem when importing into Hive,
where table rows delimiter is also ‘n’
- Nulls by default are treated as string ‘null’, Hive uses N ( --null-string (string columns) and
--null-non-string (not string columns) with escaping )
Batch ingestion - sqoop into ORC
sqoop import
-Dmapreduce.job.queuename=your.yarn.scheduler.queuename 
--connect jdbc:mysql://myserver:3306/mydatabase 
-username srvuser 
--password-file file:///path/to/.sqoop.pw.file 
-m 1 
--null-string "N" 
--null-non-string "N" 
--table mytable 
--hcatalog-database myhivedb 
--hcatalog-table mytable 
--hcatalog-partition-values "`date +%F`"
--hcatalog-partition-keys 'day' 
--hcatalog-storage-stanza 'STORED AS ORCFILE'
Batch ingestion - Hive
1. Import to external table
a. Copy files into the HDFS to external table location directory using hdfs dfs -put.
b. Partitioning is a good practice. Usually by date, for example:
/externals/myparttab/y=2017/m=04/d=29/
After uploading file to a new partition you need to create this partition in Hive metastore with MSCK
REPAIR TABLE command.
2. Import to optimized storage
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE mydb.mytable
PARTITION(`day`)
SELECT
`id`, `name`, `day`
FROM mydb.mytable_externtal;
Batch ingestion - NiFi
Powerful tool for controlling and monitoring the data flow. In GUI you
build graph of configurable processors and their relationships. You can
change data formats on the fly (ex. JSON into AVRO). Has many
processors for HDFS, Hive, Kafka, Flume, JDBC interfaces, and many
others. Might be used for both batch (Interval or Cron based) or stream
processing.
Streaming ingestion
1. Stream source (ex. a farm of web servers or connected devices)
2. Message broker (ex. Kafka)
3. Stream transport component (Flume, NiFi)
4. Stream processing engine (ex. Spark, Storm, Flink)
5. Messages format (JSON, AVRO, TEXT, Compression)
Typical streaming ingestion configuration
Data
source
source Kafka
channel
Flume Agent
source Kafka
channel
Flume Agent
Kafka cluster
Partition1
ReplicaP1
Partition2
ReplicaP2
HDFS
sink
Kafka
channel
Flume Agent
HDFS
Hive external
table
Spark, Flink, Storm...
round-robin
Data export
Hive export to Comma/Tab/Delimiter separated values formats.
CONNSTRING="jdbc:hive2://my.hive.com:10000/;principal=hive/my.hive.com@MYREALM.COM?mapreduce.job.queuename=my
.queue"
DAY=`date +%F`
/usr/bin/beeline -u ${CONNSTRING} --outputformat=tsv2 --showHeader=false --hivevar DAY=$DAY -e "SELECT * FROM
mydb.mytable where day='${hivevar:DAY}'" > mytable.${DAY}.tsv
For DSV --delimiterForDSV='ł'
Data export - Spark
Complex export can be run in Spark or MapReduce application. For example the easiest way to export to RDBMS from
Spark is direct write from DataFrame:
val prop = new java.util.Properties
val jurl = "jdbc:sqlserver://my.sql.com:1433;databaseName=mydb"
val rdbmsTab = "mytab"
def main(args: Array[String]): Unit = {
prop.setProperty("user", "myuser")
prop.setProperty("password", "XXX")
prop.setProperty("driver", "rdbms.jdbc.driver")
val sc = new SparkContext()
val sqlContext = new HiveContext(sc)
val myDF = sqlContext.sql("""
SELECT country, count(id), day
FROM mydb.mytab
WHERE day < from_unixtime(unix_timestamp(),'yyyy-MM-dd')
GROUP BY day, country
""")
myDF.write.mode("overwrite").jdbc(jurl, rdbmsTab, prop)
}
If you need to make UPDATES you need to use default JDBC DriverManager, because DataFrames can write in "error",
"append", "overwrite" and "ignore" modes.
Machine Learning model livecycle
Livecycle basing on Spark
1. Data Scientist trains the model save it as PMML or Spark ML
Pipeline
2. Depending on the need, ML model can be used by Data
Engineer in order to:
a. Once a day recalculate data and export them for
example to RDBMS.
b. Load saved model and expose it for example via REST
API, or update in-memory store like Druid.
c. Use Oryx for online model upgrades.
3. Data Scientists must have access to mesure model
effectiveness.
Oryx architecture
Spark considerations
- DF better than RDD (collection of java objects)
- Don't cache because of serialization.
- When using Spark streaming it is better to log the error to a
batch then throw it 100k times.
- Kafka’s best message size is 10k - 100k. Even 2 rows in one
transaction are better than one.
Environments for DA and DS
- Hue - primary tool for Analysts and Data Scientists.
- Beeline for accessing Hive from CLI.
- JDBC for connecting to Hive tables from Excel.
- Self-service BI tools (Tableau, Qlik, etc.).
- Jupyter - notebook for DataScientists. Can run
Scala/Python/R in one flow.
What’s next
- DataLake 3.0 (Horton Works)
- Application assembly - run multiple services in dockerized containers on YARN. Each can have it’s own
environment.
- Auto-tiering - for automatic data movement between tiers.
- Network and IO level isolation.
- Cloudera Data Science Workbench
- Collaborative platform for Data Scientists.
- Generally available since June 2017.
- Spark 2
- SparkContext and HiveContext are rebuild into SparkSession.
- Adds spark-csv library. In Spark 1.x you have to use library from Databricks manually.
- Global temporary views available for other sessions.
Thank you

Mais conteúdo relacionado

Mais procurados

Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesDataWorks Summit
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefitsRicky Barron
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 
Hybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsHybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsDavid Portnoy
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionHow One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionDataWorks Summit
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake ArchitectureDATAVERSITY
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksBig Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksAmazon Web Services
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformCaserta
 
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsHow to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsInformatica
 
Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher   Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher Tamir Dresher
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 

Mais procurados (20)

Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Hybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsHybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop Implementations
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionHow One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksBig Data & Data Lakes Building Blocks
Big Data & Data Lakes Building Blocks
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
 
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsHow to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
 
Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher   Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 

Semelhante a Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017

Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Caserta
 
Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...Tao Cheng
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
Building a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloadsBuilding a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloadsAlluxio, Inc.
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platformDavid Walker
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase Türkiye
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Precisely
 

Semelhante a Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017 (20)

Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
 
Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Building a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloadsBuilding a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloads
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platform
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik Platform
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 

Mais de Lviv Startup Club

Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...Lviv Startup Club
 
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)Lviv Startup Club
 
Nikita Zahurdaiev: PMO Tools and Technologies (UA)
Nikita Zahurdaiev: PMO Tools and Technologies (UA)Nikita Zahurdaiev: PMO Tools and Technologies (UA)
Nikita Zahurdaiev: PMO Tools and Technologies (UA)Lviv Startup Club
 
Nikita Zahurdaiev: Developing PMO Services and Functions (UA)
Nikita Zahurdaiev: Developing PMO Services and Functions (UA)Nikita Zahurdaiev: Developing PMO Services and Functions (UA)
Nikita Zahurdaiev: Developing PMO Services and Functions (UA)Lviv Startup Club
 
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)Lviv Startup Club
 
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)Lviv Startup Club
 
Stanislav Podyachev: AI Agents as Role-Playing Business Modeling Tools (UA)
Stanislav Podyachev: AI Agents as Role-Playing Business Modeling Tools (UA)Stanislav Podyachev: AI Agents as Role-Playing Business Modeling Tools (UA)
Stanislav Podyachev: AI Agents as Role-Playing Business Modeling Tools (UA)Lviv Startup Club
 
Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)
Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)
Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)Lviv Startup Club
 
Andrii Rodionov: What can go wrong in a distributed system – experience from ...
Andrii Rodionov: What can go wrong in a distributed system – experience from ...Andrii Rodionov: What can go wrong in a distributed system – experience from ...
Andrii Rodionov: What can go wrong in a distributed system – experience from ...Lviv Startup Club
 
Dmytro Tkachenko: Можливості АІ відео для бізнесу (UA)
Dmytro Tkachenko: Можливості АІ відео для бізнесу (UA)Dmytro Tkachenko: Можливості АІ відео для бізнесу (UA)
Dmytro Tkachenko: Можливості АІ відео для бізнесу (UA)Lviv Startup Club
 
Roman Kyslyi: Використання та побудова LLM агентів (UA)
Roman Kyslyi: Використання та побудова LLM агентів (UA)Roman Kyslyi: Використання та побудова LLM агентів (UA)
Roman Kyslyi: Використання та побудова LLM агентів (UA)Lviv Startup Club
 
Veronika Snizhko: Штучний інтелект як каталізатор інноваційної культури в ком...
Veronika Snizhko: Штучний інтелект як каталізатор інноваційної культури в ком...Veronika Snizhko: Штучний інтелект як каталізатор інноваційної культури в ком...
Veronika Snizhko: Штучний інтелект як каталізатор інноваційної культури в ком...Lviv Startup Club
 
Volodymyr Zhukov: Ключові труднощі в реальних імплементаціях AI. Досвід з пра...
Volodymyr Zhukov: Ключові труднощі в реальних імплементаціях AI. Досвід з пра...Volodymyr Zhukov: Ключові труднощі в реальних імплементаціях AI. Досвід з пра...
Volodymyr Zhukov: Ключові труднощі в реальних імплементаціях AI. Досвід з пра...Lviv Startup Club
 
Volodymyr Zhukov: Куди рухається ринок AI у 2024 році. Інсайти від Stanford H...
Volodymyr Zhukov: Куди рухається ринок AI у 2024 році. Інсайти від Stanford H...Volodymyr Zhukov: Куди рухається ринок AI у 2024 році. Інсайти від Stanford H...
Volodymyr Zhukov: Куди рухається ринок AI у 2024 році. Інсайти від Stanford H...Lviv Startup Club
 
Andrii Boichuk: The RAG is dead, long live the RAG або як сучасні LLM змінюют...
Andrii Boichuk: The RAG is dead, long live the RAG або як сучасні LLM змінюют...Andrii Boichuk: The RAG is dead, long live the RAG або як сучасні LLM змінюют...
Andrii Boichuk: The RAG is dead, long live the RAG або як сучасні LLM змінюют...Lviv Startup Club
 
Vladyslav Fliahin: Applications of Gen AI in CV (UA)
Vladyslav Fliahin: Applications of Gen AI in CV (UA)Vladyslav Fliahin: Applications of Gen AI in CV (UA)
Vladyslav Fliahin: Applications of Gen AI in CV (UA)Lviv Startup Club
 
Artem Ternov: Побудова платформи під DataEngineering та DataScience в ентерпр...
Artem Ternov: Побудова платформи під DataEngineering та DataScience в ентерпр...Artem Ternov: Побудова платформи під DataEngineering та DataScience в ентерпр...
Artem Ternov: Побудова платформи під DataEngineering та DataScience в ентерпр...Lviv Startup Club
 
Michael Vidyakin: Defining PMO Structure and Governance (UA)
Michael Vidyakin: Defining PMO Structure and Governance (UA)Michael Vidyakin: Defining PMO Structure and Governance (UA)
Michael Vidyakin: Defining PMO Structure and Governance (UA)Lviv Startup Club
 
Michael Vidyakin: Assessing Organizational Readiness (UA)
Michael Vidyakin: Assessing Organizational Readiness (UA)Michael Vidyakin: Assessing Organizational Readiness (UA)
Michael Vidyakin: Assessing Organizational Readiness (UA)Lviv Startup Club
 
Michael Vidyakin: Introduction to PMO (UA)
Michael Vidyakin: Introduction to PMO (UA)Michael Vidyakin: Introduction to PMO (UA)
Michael Vidyakin: Introduction to PMO (UA)Lviv Startup Club
 

Mais de Lviv Startup Club (20)

Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
 
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
 
Nikita Zahurdaiev: PMO Tools and Technologies (UA)
Nikita Zahurdaiev: PMO Tools and Technologies (UA)Nikita Zahurdaiev: PMO Tools and Technologies (UA)
Nikita Zahurdaiev: PMO Tools and Technologies (UA)
 
Nikita Zahurdaiev: Developing PMO Services and Functions (UA)
Nikita Zahurdaiev: Developing PMO Services and Functions (UA)Nikita Zahurdaiev: Developing PMO Services and Functions (UA)
Nikita Zahurdaiev: Developing PMO Services and Functions (UA)
 
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)
 
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)
 
Stanislav Podyachev: AI Agents as Role-Playing Business Modeling Tools (UA)
Stanislav Podyachev: AI Agents as Role-Playing Business Modeling Tools (UA)Stanislav Podyachev: AI Agents as Role-Playing Business Modeling Tools (UA)
Stanislav Podyachev: AI Agents as Role-Playing Business Modeling Tools (UA)
 
Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)
Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)
Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)
 
Andrii Rodionov: What can go wrong in a distributed system – experience from ...
Andrii Rodionov: What can go wrong in a distributed system – experience from ...Andrii Rodionov: What can go wrong in a distributed system – experience from ...
Andrii Rodionov: What can go wrong in a distributed system – experience from ...
 
Dmytro Tkachenko: Можливості АІ відео для бізнесу (UA)
Dmytro Tkachenko: Можливості АІ відео для бізнесу (UA)Dmytro Tkachenko: Можливості АІ відео для бізнесу (UA)
Dmytro Tkachenko: Можливості АІ відео для бізнесу (UA)
 
Roman Kyslyi: Використання та побудова LLM агентів (UA)
Roman Kyslyi: Використання та побудова LLM агентів (UA)Roman Kyslyi: Використання та побудова LLM агентів (UA)
Roman Kyslyi: Використання та побудова LLM агентів (UA)
 
Veronika Snizhko: Штучний інтелект як каталізатор інноваційної культури в ком...
Veronika Snizhko: Штучний інтелект як каталізатор інноваційної культури в ком...Veronika Snizhko: Штучний інтелект як каталізатор інноваційної культури в ком...
Veronika Snizhko: Штучний інтелект як каталізатор інноваційної культури в ком...
 
Volodymyr Zhukov: Ключові труднощі в реальних імплементаціях AI. Досвід з пра...
Volodymyr Zhukov: Ключові труднощі в реальних імплементаціях AI. Досвід з пра...Volodymyr Zhukov: Ключові труднощі в реальних імплементаціях AI. Досвід з пра...
Volodymyr Zhukov: Ключові труднощі в реальних імплементаціях AI. Досвід з пра...
 
Volodymyr Zhukov: Куди рухається ринок AI у 2024 році. Інсайти від Stanford H...
Volodymyr Zhukov: Куди рухається ринок AI у 2024 році. Інсайти від Stanford H...Volodymyr Zhukov: Куди рухається ринок AI у 2024 році. Інсайти від Stanford H...
Volodymyr Zhukov: Куди рухається ринок AI у 2024 році. Інсайти від Stanford H...
 
Andrii Boichuk: The RAG is dead, long live the RAG або як сучасні LLM змінюют...
Andrii Boichuk: The RAG is dead, long live the RAG або як сучасні LLM змінюют...Andrii Boichuk: The RAG is dead, long live the RAG або як сучасні LLM змінюют...
Andrii Boichuk: The RAG is dead, long live the RAG або як сучасні LLM змінюют...
 
Vladyslav Fliahin: Applications of Gen AI in CV (UA)
Vladyslav Fliahin: Applications of Gen AI in CV (UA)Vladyslav Fliahin: Applications of Gen AI in CV (UA)
Vladyslav Fliahin: Applications of Gen AI in CV (UA)
 
Artem Ternov: Побудова платформи під DataEngineering та DataScience в ентерпр...
Artem Ternov: Побудова платформи під DataEngineering та DataScience в ентерпр...Artem Ternov: Побудова платформи під DataEngineering та DataScience в ентерпр...
Artem Ternov: Побудова платформи під DataEngineering та DataScience в ентерпр...
 
Michael Vidyakin: Defining PMO Structure and Governance (UA)
Michael Vidyakin: Defining PMO Structure and Governance (UA)Michael Vidyakin: Defining PMO Structure and Governance (UA)
Michael Vidyakin: Defining PMO Structure and Governance (UA)
 
Michael Vidyakin: Assessing Organizational Readiness (UA)
Michael Vidyakin: Assessing Organizational Readiness (UA)Michael Vidyakin: Assessing Organizational Readiness (UA)
Michael Vidyakin: Assessing Organizational Readiness (UA)
 
Michael Vidyakin: Introduction to PMO (UA)
Michael Vidyakin: Introduction to PMO (UA)Michael Vidyakin: Introduction to PMO (UA)
Michael Vidyakin: Introduction to PMO (UA)
 

Último

Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEaurabinda banchhor
 

Último (20)

Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptx
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSE
 

Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017

  • 1. Data Lake Architecture A guide by Artur Fejklowicz
  • 2. Whoami I work at TVN S.A., which is the biggest polish commercial TV broadcaster. We have broad variety of TV channels (linear TV) like TVN, TVN24, HGTV, TVN Style, TVN Turbo. We produce great content, which we share at our VoD platform player.pl (non-linear TV). www.tvn24.pl is our news portal, where people can watch live stream of TVN24 channel and read breaking news. As BigData Solutions Architect I lead Data Engineers team. We support Data Lake and make data available for business users. We also build data processing pipelines. I am Cloudera Certified Administrator for Apache Hadoop. https://www.linkedin.com/in/arturr/
  • 3. How to start with the Data Lake Tip Dig great hole and ask friends to dive into the lake with you.
  • 4. Business clients Write down who/what will use the Data Lake. It can be Data Scientists, Data Analysts or applications. It will affect SLA’s and security levels. They will take the responsibility for collecting the information out of data and their proper interpretation.
  • 5. Data sources inventory Check sizing Daily size will let you know how big storage you will need for the data. HDFS stores 3 copies by default, but data can be compressed. Format What is current data format (CSV, some log with space separated fields, multiline, stream format (AVRO, JSON)) Interfaces External/internal, firewall issues, security, encoding Ingest frequency How often does the business users want the date to be available for them. For analytics it is usually 1day. For online tracking it might be 1minute, 10s or even less then 1s.
  • 6. Teams definitions Data Scientists Responsible for asking the right questions against data with properly build/choose algorithms to find the right answers. Need to have crossdomain skills/experience to see the “big picture”. Usually should know: machine learning, programming in Java/Scala/Python, R, SQL, also should know statistical analysis, conceptual modeling. "A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.” Josh Wills former Director of Data Science at Cloudera Data Analysts Helps Data Scientists in their jobs. Performs analysis on data sets from particular system. Should know data mining and how to prepare summary reports also should have high skills in using Business Intelligence tools like Tableau or Qlik. “Junior Data Scientist” Data Engineers Builds data processing systems with Hadoop using self written applications in Java/Scala/Python for Spark or using software like Hive, Sqoop, Kafka, Flume or NiFi. They decide about hardware and software needs. They keep the Data Platform up and running.
  • 7. Technology Hadoop distribution Cloudera, HortonWorks, MapR, self build, cloud, other. Services to install Depends on business user needs. For example if you plan to use complex processing pipelines install Luigi or Jenkins instead of Oozie. Less number of services is better for fast start with Data Lake and later support. BI Tools Depends on business user needs.
  • 8. Security Securing your Data Lake is always a good idea, but will delay the start of Data Lake, so you need to balance when to enable security. The earlier you secure your Hadoop Data Lake the less problems it will create, because Data Lake will grow fast with number of users, data sources, ETL’s and services. By default Hadoop trusts that user presented at login is that user. Securing Hadoop is at least to enable Kerberos authentication (often with LDAP/AD backend). Full AAA Security model with active auditing is very time consuming to implement and might need commercial support. Authentication needs Kerberos Authorization needs Sentry or Ranger. Auditing can be passive (good for start) or active.
  • 9. Infrastructure Usually it is better to have more smaller servers then less strong ones - we want to parallel the computations. You might need strong boxes (with great amount of RAM) for services like Impala or Spark, that will do in-memory processing. It is always a good idea to have 4x 1Gbps network ports as LACP bond if your LAN switches supports this. It is also good idea to plan HDFS tiering or to send “cold data” to cheap cloud storage. Put worker nodes (HDFS DataNode + YARN NodeManagers) on bare metal machines. Though masters virtualization is something to consider.
  • 11. Data Lake environments Minimum 2 environments (RC and Prod), the best is to have minimum 3. Environment is Hadoop cluster + BI Tools + ETL jobs and any other services you will change/implement in Production. 1. Testing (R&D) - You can test new services there, new functionalities. It can be shutdown any time. Users can have their own environments for their own tests. It might be totally virtualized. 2. Release Candidate (RC) - Has the same configuration as Production, but with minimal hardware resources - can be virtualized). For testing software upgrades, and configuration changes. For example, when you implement Kerberos authentication this is a “must have“ environment. Access has only selected users, that need to prepare production change. 3. Production - Business users must obey the rules in order to keep the Data Lake services SLA’s.
  • 12. Storage file formats 1. Row (easily human readable, slow) a. Used in RDBMS for SUID queries, that works on many columns at a time b. Not so good for aggregations, because you need to do full table scan c. Not so good for compression, because in one row are neighboring various data types 1,1,search,122.303.444,t_{0},/products/search/filters=tshirts,US; --row 1 2,1,view,122.303.444,t_{1},/products/id=11212,US; --row 2 ...; 7,9,checkout,121.322.432,t_{2},/checkout/cart=478/status=confirmed,UK; --row 7 2. Columnar (very hard to read by human, fast) a. Values from particular column are stored as neighbors to each other. b. Very good for aggregations because you have to only get blocks of columns, without full table scan c. Very good for compression, the same value types are in one row (ex. Runlength encoding for integers, Dictionary for strings). 1,2,3,4,5,6,7; 1,3,9; Search,view,add_to_cart,checkout; ...; US,FR,UK;
  • 13. Storage file formats - TEXTFILE TEXTFILE Row format. Very good for start. Good for storing CSV/TSV. CREATE EXTERNAL TABLE `mydb.peoples_transactions` ( `id` INT COMMENT 'Personal id', `name` STRING COMMENT 'Name of a person', `day` STRING COMMENT 'Transaction date' ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION '/externals/mydb/peoples_transactions';
  • 14. Storage file formats - Avro - Selfdescribing - schema can be part of the data file. - Supports column aliases for schema evolution. - Schema can be defined in file (SERDEPROPERTIES) or in table. definition (TABLEPROPERTIES). - Easy transformation from/to json. CREATE TEMPORARY EXTERNAL TABLE temporary_myavro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.url'='/avro_schemas/myavro.avsc') STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/externals/myavro'; TABLEPROPERTIES ('avro.schema.literal'='{ "namespace": "testing.hive.avro.serde", "name": "peoples_transactions", "type": "record", "fields": [ { "Name":"id", "type":"int", "doc":"Personal id" }, { "Name":"name", "type":"string", "doc":"Name of a person" }, { "Name":"day", "type":"string", "doc":"Transaction date" }, ], "doc":"Table with transactions" })
  • 15. Storage file formats - Parquet Parquet The most widely used columnar file format. Supported by Cloudera’s Impala in memory engine, Hive and Spark. Has basic statistic - number of elements in column stored in particular row group. Very good for Spark when using Tungsten.
  • 16. Storage file formats - ORC ORC - Has 3 index levels (file, stripe and 10k rows). - Even 78% smaller files. - Basic statistics: min, max, sum, count per column, per stripe and file. - When inserting into table try to sort data by most used column. - Supports predicate pushdown.
  • 17. Storage file formats - orcdump beeline --service orcfiledump /user/hive/warehouse/mydb.db/mytable/day=2017-04-29/000000_0 … Column 115: count: 5000 hasNull: false min: max: 0 sum: 2128 Column 116: count: 5000 hasNull: false min: max: 992 sum: 7811 Stripe 10: Column 0: count: 5000 hasNull: false Column 1: count: 0 hasNull: true sum: 0 … File Statistics: Column 0: count: 545034 hasNull: false Column 1: count: 0 hasNull: true sum: 0 … Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 … Encoding column 115: DICTIONARY_V2[1] Encoding column 116: DICTIONARY_V2[1]
  • 18. Storage file formats - example ORC write and load into DataFrame import org.apache.spark.sql._ val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) case class Contact(name: String, phone: String) case class Person(name: String, age: Int, contacts: Seq[Contact]) val records = (1 to 100).map { i =>; Person(s"name_$i", i, (0 to 1).map { m => Contact(s"contact_$m", s"phone_$m") }) } sc.parallelize(records).toDF().write.format("orc").save("people") val people = sqlContext.read.format("orc").load("people") people.registerTempTable("people") Predicate pushdown example sqlContext.setConf("spark.sql.orc.filterPushdown", "true") sqlContext.sql("SELECT name FROM peopleWHERE age < 15").count()
  • 19. Storage file formats - compression None - fast data access, but large files sizes and large network bandwidth. SNAPPY - written by Google. Very fast, but not splittable - decoding will be done on one CPU Core. Compression level about 40%. Usually the best choice. Gzip - Good compression ratio. Slower than Snappy. High CPU usage.
  • 20. Batch ingestion There are various methods to do batch ingestion depending from the source of data. Sqoop is used mostly to import from RDBMS. Direct files upload to Hive is also possible. There are also other tools like NiFi to monitor and orchestrate the ingestion.
  • 21. Batch ingestion - sqoop - Very good for copying tables from RDBMS into HDFS files or Hive tables. - Number of output files can be steered by number of mappers used. - Connects using JDBC drivers for particular database but not only. - By default stores data in Textfiles, SequenceFiles and AVRO. - Supports HCatalog - useful for import to other storage formats. - Supports incremental import based on last-value (append and lastmodified). - You can specify query to be imported. - Supports compression Gzip (by default not enabled) and other algorithms. - Carefull must be taken for many exceptions when importing, for example: - Fields in database can have new lines characters this can be problem when importing into Hive, where table rows delimiter is also ‘n’ - Nulls by default are treated as string ‘null’, Hive uses N ( --null-string (string columns) and --null-non-string (not string columns) with escaping )
  • 22. Batch ingestion - sqoop into ORC sqoop import -Dmapreduce.job.queuename=your.yarn.scheduler.queuename --connect jdbc:mysql://myserver:3306/mydatabase -username srvuser --password-file file:///path/to/.sqoop.pw.file -m 1 --null-string "N" --null-non-string "N" --table mytable --hcatalog-database myhivedb --hcatalog-table mytable --hcatalog-partition-values "`date +%F`" --hcatalog-partition-keys 'day' --hcatalog-storage-stanza 'STORED AS ORCFILE'
  • 23. Batch ingestion - Hive 1. Import to external table a. Copy files into the HDFS to external table location directory using hdfs dfs -put. b. Partitioning is a good practice. Usually by date, for example: /externals/myparttab/y=2017/m=04/d=29/ After uploading file to a new partition you need to create this partition in Hive metastore with MSCK REPAIR TABLE command. 2. Import to optimized storage SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; INSERT OVERWRITE TABLE mydb.mytable PARTITION(`day`) SELECT `id`, `name`, `day` FROM mydb.mytable_externtal;
  • 24. Batch ingestion - NiFi Powerful tool for controlling and monitoring the data flow. In GUI you build graph of configurable processors and their relationships. You can change data formats on the fly (ex. JSON into AVRO). Has many processors for HDFS, Hive, Kafka, Flume, JDBC interfaces, and many others. Might be used for both batch (Interval or Cron based) or stream processing.
  • 25. Streaming ingestion 1. Stream source (ex. a farm of web servers or connected devices) 2. Message broker (ex. Kafka) 3. Stream transport component (Flume, NiFi) 4. Stream processing engine (ex. Spark, Storm, Flink) 5. Messages format (JSON, AVRO, TEXT, Compression) Typical streaming ingestion configuration Data source source Kafka channel Flume Agent source Kafka channel Flume Agent Kafka cluster Partition1 ReplicaP1 Partition2 ReplicaP2 HDFS sink Kafka channel Flume Agent HDFS Hive external table Spark, Flink, Storm... round-robin
  • 26. Data export Hive export to Comma/Tab/Delimiter separated values formats. CONNSTRING="jdbc:hive2://my.hive.com:10000/;principal=hive/my.hive.com@MYREALM.COM?mapreduce.job.queuename=my .queue" DAY=`date +%F` /usr/bin/beeline -u ${CONNSTRING} --outputformat=tsv2 --showHeader=false --hivevar DAY=$DAY -e "SELECT * FROM mydb.mytable where day='${hivevar:DAY}'" > mytable.${DAY}.tsv For DSV --delimiterForDSV='ł'
  • 27. Data export - Spark Complex export can be run in Spark or MapReduce application. For example the easiest way to export to RDBMS from Spark is direct write from DataFrame: val prop = new java.util.Properties val jurl = "jdbc:sqlserver://my.sql.com:1433;databaseName=mydb" val rdbmsTab = "mytab" def main(args: Array[String]): Unit = { prop.setProperty("user", "myuser") prop.setProperty("password", "XXX") prop.setProperty("driver", "rdbms.jdbc.driver") val sc = new SparkContext() val sqlContext = new HiveContext(sc) val myDF = sqlContext.sql(""" SELECT country, count(id), day FROM mydb.mytab WHERE day < from_unixtime(unix_timestamp(),'yyyy-MM-dd') GROUP BY day, country """) myDF.write.mode("overwrite").jdbc(jurl, rdbmsTab, prop) } If you need to make UPDATES you need to use default JDBC DriverManager, because DataFrames can write in "error", "append", "overwrite" and "ignore" modes.
  • 28. Machine Learning model livecycle Livecycle basing on Spark 1. Data Scientist trains the model save it as PMML or Spark ML Pipeline 2. Depending on the need, ML model can be used by Data Engineer in order to: a. Once a day recalculate data and export them for example to RDBMS. b. Load saved model and expose it for example via REST API, or update in-memory store like Druid. c. Use Oryx for online model upgrades. 3. Data Scientists must have access to mesure model effectiveness.
  • 30. Spark considerations - DF better than RDD (collection of java objects) - Don't cache because of serialization. - When using Spark streaming it is better to log the error to a batch then throw it 100k times. - Kafka’s best message size is 10k - 100k. Even 2 rows in one transaction are better than one.
  • 31. Environments for DA and DS - Hue - primary tool for Analysts and Data Scientists. - Beeline for accessing Hive from CLI. - JDBC for connecting to Hive tables from Excel. - Self-service BI tools (Tableau, Qlik, etc.). - Jupyter - notebook for DataScientists. Can run Scala/Python/R in one flow.
  • 32. What’s next - DataLake 3.0 (Horton Works) - Application assembly - run multiple services in dockerized containers on YARN. Each can have it’s own environment. - Auto-tiering - for automatic data movement between tiers. - Network and IO level isolation. - Cloudera Data Science Workbench - Collaborative platform for Data Scientists. - Generally available since June 2017. - Spark 2 - SparkContext and HiveContext are rebuild into SparkSession. - Adds spark-csv library. In Spark 1.x you have to use library from Databricks manually. - Global temporary views available for other sessions.