SlideShare uma empresa Scribd logo
1 de 41
Page1
Mutable Data in Hive’s Immutable World
Lester Martin – Hortonworks 2015 Hadoop Summit
Page2
Connection before Content
Lester Martin – Hortonworks Professional Services
lmartin@hortonworks.com || lester.martin@gmail.com
http://lester.website (links to blog, twitter,
github, LI, FB, etc)
Page3
“Traditional” Hadoop Data
Time-Series Immutable (TSI) Data – Hive’s sweet spot
Going beyond web logs to more exotic data such as:
Vehicle sensors (ground, air, above/below water – space!)
Patient data (to include the atmosphere around them)
Smart phone/watch (TONS of info)
Clickstream Web
& Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
SOURCES
Page4
Good TSI Solutions Exist
Hive partitions
•Store as much as you want
•Only read the files you need
Hive Streaming Data Ingest from Flume or Storm
Sqoop’s –-incremental mode of append
•Use appropriate –-check-column
•“Saved Job” remembering –last-value
Page5
Use Case for an Active Archive
Evolving Domain Data – Hive likes immutable data
Need exact copy of mutating tables refreshed periodically
•Structural replica of multiple RDBMS tables
•The data in these tables are being updated
•Don’t need every change; just “as of” content
Existing Systems
ERP CRM SCM
SOURCES
eComm
Page6
Start With a Full Refresh Strategy
The epitome of the KISS principle
•Ingest & load new data
•Drop the existing table
•Rename the newly created table
Surely not elegant, but solves the problem until the reload
takes longer than the refresh period
Page7
Then Evolve to a Merge & Replace Strategy
Typically, deltas are…
•Small % of existing data
•Plus, some totally new records
In practice, differences in sizes
of circles is often much more
pronounced
Page8
Requirements for Merge & Replace
An immutable unique key
•To determine if an addition or a change
•The source table’s (natural or surrogate) PK is perfect
A last-updated timestamp to find the deltas
Leverage Sqoop’s –-incremental mode of
lastmodified to identify the deltas
•Use appropriate –-check-column
•“Saved Job” remembering –last-value
Page9
Processing Steps for Merge & Replace
See blog at http://hortonworks.com/blog/four-step-
strategy-incremental-updates-hive/, but note that merge
can be done in multiple technologies, not just Hive
Ingest – bring over the incremental
data
Reconcile – perform the merge
Compact – replace the existing data
with the newly merged content
Purge – cleanup & prepare to repeat
Page10
Full Merge & Replace Will NOT Scale
The “elephant” eventually gets too big
and merging it with the “mouse” takes
too long!
Example: A Hive structure with 100
billion rows, but only 100,000 delta
records
Page11
What Will? The Classic Hadoop Strategy!
Page12
But… One Size Does NOT Fit All…
Not everything is “big” – in fact, most operational apps’
tables are NOT too big for a simple Full Refresh
Divide & Conquer requires additional per-table research
to ensure the best partitioning strategy is decided upon
Page13
Criteria for Active Archive Partition Values
Non-nullable & immutable
Ensures sliding scale growth with new records generally
creating new partitions
Supports delta records being skewed such that the
percentage of partitions needing merge & replace
operations is relatively small
Classic value is (still) “Date
Created”
Page14
Work on (FEW!) Partitions in Parallel
Page15
Partition-Level Merge & Replace Steps
Generate the delta file
Create list of affected partitions
Perform merge & replace operations for affected partitions
1. Filter the delta file for the current partition
2. Load the Hive table’s current partition
3. Merge the two datasets
4. Delete the existing partition
5. Recreate the partition with the merged content
Page16
What Does This Approach Look Like?
A Lightning-Fast Review of an Indicative Hybrid Pig-Hive Example
Page17
One-Time: Create the Table
CREATE TABLE bogus_info(
bogus_id int,
field_one string,
field_two string,
field_three string)
PARTITIONED BY (date_created STRING)
STORED AS ORC
TBLPROPERTIES ("orc.compress"="ZLIB");
Page18
One-Time: Get Content from the Source
11,2014-09-17,base,base,base
12,2014-09-17,base,base,base
13,2014-09-17,base,base,base
14,2014-09-18,base,base,base
15,2014-09-18,base,base,base
16,2014-09-18,base,base,base
17,2014-09-19,base,base,base
18,2014-09-19,base,base,base
19,2014-09-19,base,base,base
Page19
One-Time: Read Content from HDFS
as_recd = LOAD '/user/fred/original.txt'
USING PigStorage(',') AS
(
bogus_id:int,
date_created:chararray,
field_one:chararray,
field_two:chararray,
field_three:chararray
);
Page20
One-Time: Sort and Insert into Hive Table
sorted_as_recd = ORDER as_recd BY
date_created, bogus_id;
STORE sorted_as_recd INTO 'bogus_info'
USING
org.apache.hcatalog.pig.HCatStorer();
Page21
One-Time: Verify Data are Present
hive> select * from bogus_info;
11 base base base 2014-09-17
12 base base base 2014-09-17
13 base base base 2014-09-17
14 base base base 2014-09-18
15 base base base 2014-09-18
16 base base base 2014-09-18
17 base base base 2014-09-19
18 base base base 2014-09-19
19 base base base 2014-09-19
Page22
One-Time: Verify Partitions are Present
hdfs dfs -ls /apps/hive/warehouse/bogus_info
Found 3 items
…
/apps/hive/warehouse/bogus_info/date_created
=2014-09-17
…
/apps/hive/warehouse/bogus_info/date_created
=2014-09-18
…
/apps/hive/warehouse/bogus_info/date_created
=2014-09-19
Page23
Generate the Delta File
20,2014-09-20,base,base,base
21,2014-09-20,base,base,base
22,2014-09-20,base,base,base
12,2014-09-17,base,CHANGED,base
14,2014-09-18,base,CHANGED,base
16,2014-09-18,base,CHANGED,base
Page24
Read Delta File from HDFS
delta_recd = LOAD '/user/fred/delta1.txt'
USING PigStorage(',') AS
(
bogus_id:int,
date_created:chararray,
field_one:chararray,
field_two:chararray,
field_three:chararray
);
Page25
Create List of Affected Partitions
by_grp = GROUP delta_recd BY date_created;
part_names = FOREACH by_grp GENERATE group;
srtd_part_names = ORDER part_names BY group;
STORE srtd_part_names INTO
'/user/fred/affected_partitions’;
Page26
Loop/Multithread Through Affected Partitions
Pig doesn’t really help you with this problem
This indicative example could be implemented as:
•A simple script that loops through the partitions
•A Java program that multi-threads the partition-aligned processing
Multiple “Control Structures” options exist as described at
http://pig.apache.org/docs/r0.14.0/cont.html
Page27
Loop Step: Filter on the Current Partition
delta_recd = LOAD '/user/fred/delta1.txt'
USING PigStorage(',') AS
( bogus_id:int, date_created:chararray,
field_one:chararray,
field_two:chararray,
field_three:chararray );
deltaP = FILTER delta_recd BY date_created
== '$partition_key’;
Page28
Loop Step: Retrieve Hive’s Current Partition
all_bogus_info = LOAD 'bogus_info' USING
org.apache.hcatalog.pig.HCatLoader();
tblP = FILTER all_bogus_info
BY date_created == '$partition_key';
Page29
Loop Step: Merge the Datasets
partJ = JOIN tblP BY bogus_id FULL OUTER,
deltaP BY bogus_id;
combined_part = FOREACH partJ GENERATE
((deltaP::bogus_id is not null) ?
deltaP::bogus_id: tblP::bogus_id) as
bogus_id, /* do for all fields
and end with “;” */
Page30
Loop Step: Sort and Save the Merged Data
s_combined_part = ORDER combined_part BY
date_created, bogus_id;
STORE s_combined_part INTO '/user/fred/
temp_$partition_key’ USING PigStorage(',');
hdfs dfs –cat temp_2014-09-17/part-r-00000
11,2014-09-17,base,base,base
12,2014-09-17,base,CHANGED,base
13,2014-09-17,base,base,base
Page31
Loop Step: Delete the Partition
ALTER TABLE bogus_info DROP IF EXISTS
PARTITION (date_created='2014-09-17’);
Page32
Loop Step: Recreate the Partition
2load = LOAD '/user/fred/
temp_$partition_key'
USING PigStorage(',') AS
( bogus_id:int, date_created:chararray,
field_one:chararray,
field_two:chararray,
field_three:chararray );
STORE 2load INTO 'bogus_info' using
org.apache.hcatalog.pig.HCatStorer();
Page33
Verify the Loop Step Updates
select * from bogus_info
where date_created = '2014-09-17’;
11 base base base 2014-09-17
12 base CHANGED base 2014-09-17
13 base base base 2014-09-17
Page34
My Head Hurts, Too!
As Promised, We Flew Through That – Take Another Look Later
Page35
What Does Merge & Replace Miss?
If critical, you have options
•Create a delete table sourced by a trigger
•At some wide frequency, start all over with a Full Refresh
Fortunately, ~most~ enterprises
don’t delete anything
Marking items “inactive” is
popular
Page36
Hybrid: Partition-Level Refresh
If most of the partition is modified, just replace it entirely
Especially if the changes are only recent (or highly skewed)
Use a configured number of partitions to refresh and
assume the rest of the data is static
Page37
Active Archive Strategy Review
strategy # of rows % of
chg
chg
skew
handles
deletes
complexity
Full Refresh <= millions any any yes simple
Full Merge &
Replace
<= millions any any no moderate
Partition-Level
Merge & Replace
billions + < 5% < 5% no complex
Partition-Level
Refresh
billions + < 5% < 5% yes complex
Page38
Isn’t There Anything Easier?
HIVE-5317 brought us Insert, Update & Delete
•Alan Gates presented Monday
•More tightly-coupled w/o the same “hazard windows”
•“Driver” logic shifts to be delta-only & row-focused
Thoughts & attempts at true DB replication
•Some COTS solutions have been tried
•Ideally, an open-source alternative is best such as enhancing the
Streaming Data Ingest framework
Page39
Considerations for HIVE-5317
On performance & scalability; your mileage may vary
Does NOT make Hive a RDBMS
Available in Hive .14 onwards
DDL requirements
•Must utilize partitioning & bucketing
•Initially, only supports ORC
Page40
Recommendations
Take another look at this topic once back at “your desk”
As with all things Hadoop…
•Know your data & workloads
•Try several approaches & evaluate results in earnest
•Stick with the KISS principle whenever possible
Share your findings via blogs and local user groups
Expect (even more!) great things from Hive
Page41
Questions?
Lester Martin – Hortonworks Professional Services
lmartin@hortonworks.com || lester.martin@gmail.com
http://lester.website (links to blog, twitter, github, LI, FB, etc)
THANKS FOR YOUR TIME!!

Mais conteúdo relacionado

Mais procurados

Presentation slides of Sequence Query Language (SQL)
Presentation slides of Sequence Query Language (SQL)Presentation slides of Sequence Query Language (SQL)
Presentation slides of Sequence Query Language (SQL)Punjab University
 
Troubleshooter´s Top 5: Real-world performance tuning of IFS Applications
Troubleshooter´s Top 5: Real-world performance tuning of IFS ApplicationsTroubleshooter´s Top 5: Real-world performance tuning of IFS Applications
Troubleshooter´s Top 5: Real-world performance tuning of IFS ApplicationsIFS
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slidesmetsarin
 
Types Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql ServerTypes Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql Serverprogrammings guru
 
Constraints In Sql
Constraints In SqlConstraints In Sql
Constraints In SqlAnurag
 
Nested Queries Lecture
Nested Queries LectureNested Queries Lecture
Nested Queries LectureFelipe Costa
 
Introduction VAUUM, Freezing, XID wraparound
Introduction VAUUM, Freezing, XID wraparoundIntroduction VAUUM, Freezing, XID wraparound
Introduction VAUUM, Freezing, XID wraparoundMasahiko Sawada
 
pl/sql Procedure
pl/sql Procedurepl/sql Procedure
pl/sql ProcedurePooja Dixit
 
PostgreSQL HA
PostgreSQL   HAPostgreSQL   HA
PostgreSQL HAharoonm
 
An Overview of Temporal Features in SQL:2011
An Overview of Temporal Features in SQL:2011An Overview of Temporal Features in SQL:2011
An Overview of Temporal Features in SQL:2011Craig Baumunk
 

Mais procurados (20)

Sql join
Sql  joinSql  join
Sql join
 
Get to know PostgreSQL!
Get to know PostgreSQL!Get to know PostgreSQL!
Get to know PostgreSQL!
 
Presentation slides of Sequence Query Language (SQL)
Presentation slides of Sequence Query Language (SQL)Presentation slides of Sequence Query Language (SQL)
Presentation slides of Sequence Query Language (SQL)
 
Sql joins
Sql joinsSql joins
Sql joins
 
Troubleshooter´s Top 5: Real-world performance tuning of IFS Applications
Troubleshooter´s Top 5: Real-world performance tuning of IFS ApplicationsTroubleshooter´s Top 5: Real-world performance tuning of IFS Applications
Troubleshooter´s Top 5: Real-world performance tuning of IFS Applications
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slides
 
Sql tutorial
Sql tutorialSql tutorial
Sql tutorial
 
Types Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql ServerTypes Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql Server
 
set operators.pptx
set operators.pptxset operators.pptx
set operators.pptx
 
Constraints In Sql
Constraints In SqlConstraints In Sql
Constraints In Sql
 
Nested Queries Lecture
Nested Queries LectureNested Queries Lecture
Nested Queries Lecture
 
MySQL JOINS
MySQL JOINSMySQL JOINS
MySQL JOINS
 
Introduction VAUUM, Freezing, XID wraparound
Introduction VAUUM, Freezing, XID wraparoundIntroduction VAUUM, Freezing, XID wraparound
Introduction VAUUM, Freezing, XID wraparound
 
SQL
SQLSQL
SQL
 
pl/sql Procedure
pl/sql Procedurepl/sql Procedure
pl/sql Procedure
 
PostgreSQL HA
PostgreSQL   HAPostgreSQL   HA
PostgreSQL HA
 
Sql commands
Sql commandsSql commands
Sql commands
 
DML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with Examples
DML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with ExamplesDML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with Examples
DML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with Examples
 
An Overview of Temporal Features in SQL:2011
An Overview of Temporal Features in SQL:2011An Overview of Temporal Features in SQL:2011
An Overview of Temporal Features in SQL:2011
 
MS SQL Server
MS SQL ServerMS SQL Server
MS SQL Server
 

Destaque

IoT: How Data Science Driven Software is Eating the Connected World
IoT: How Data Science Driven Software is Eating the Connected WorldIoT: How Data Science Driven Software is Eating the Connected World
IoT: How Data Science Driven Software is Eating the Connected WorldDataWorks Summit
 
The Future of Hadoop Security
The Future of Hadoop SecurityThe Future of Hadoop Security
The Future of Hadoop SecurityDataWorks Summit
 
하둡 타입과 포맷
하둡 타입과 포맷하둡 타입과 포맷
하둡 타입과 포맷진호 박
 
Dealing with Changed Data in Hadoop
Dealing with Changed Data in HadoopDealing with Changed Data in Hadoop
Dealing with Changed Data in HadoopDataWorks Summit
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...DataWorks Summit
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)DataWorks Summit
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit
 
Running Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionRunning Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionDataWorks Summit
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenDataWorks Summit
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...DataWorks Summit
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentDataWorks Summit
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeDataWorks Summit
 
Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets DataWorks Summit
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...DataWorks Summit
 
Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]DataWorks Summit
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
 

Destaque (20)

IoT: How Data Science Driven Software is Eating the Connected World
IoT: How Data Science Driven Software is Eating the Connected WorldIoT: How Data Science Driven Software is Eating the Connected World
IoT: How Data Science Driven Software is Eating the Connected World
 
The Future of Hadoop Security
The Future of Hadoop SecurityThe Future of Hadoop Security
The Future of Hadoop Security
 
하둡 타입과 포맷
하둡 타입과 포맷하둡 타입과 포맷
하둡 타입과 포맷
 
Dealing with Changed Data in Hadoop
Dealing with Changed Data in HadoopDealing with Changed Data in Hadoop
Dealing with Changed Data in Hadoop
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
Running Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionRunning Spark and MapReduce together in Production
Running Spark and MapReduce together in Production
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP Haven
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
 
Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 

Semelhante a Mutable Data in Hive's Immutable World

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveQubole
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.pptRutujaPatil247341
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it bettergvernik
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?gvernik
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook AhmedDoukh
 
JDD2014: Real Big Data - Scott MacGregor
JDD2014: Real Big Data - Scott MacGregorJDD2014: Real Big Data - Scott MacGregor
JDD2014: Real Big Data - Scott MacGregorPROIDEA
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Making pig fly optimizing data processing on hadoop presentation
Making pig fly  optimizing data processing on hadoop presentationMaking pig fly  optimizing data processing on hadoop presentation
Making pig fly optimizing data processing on hadoop presentationMd Rasool
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
SAP HANA SPS10- Enterprise Information Management
SAP HANA SPS10- Enterprise Information ManagementSAP HANA SPS10- Enterprise Information Management
SAP HANA SPS10- Enterprise Information ManagementSAP Technology
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 

Semelhante a Mutable Data in Hive's Immutable World (20)

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
JDD2014: Real Big Data - Scott MacGregor
JDD2014: Real Big Data - Scott MacGregorJDD2014: Real Big Data - Scott MacGregor
JDD2014: Real Big Data - Scott MacGregor
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Making pig fly optimizing data processing on hadoop presentation
Making pig fly  optimizing data processing on hadoop presentationMaking pig fly  optimizing data processing on hadoop presentation
Making pig fly optimizing data processing on hadoop presentation
 
Data Science
Data ScienceData Science
Data Science
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
SAP HANA SPS10- Enterprise Information Management
SAP HANA SPS10- Enterprise Information ManagementSAP HANA SPS10- Enterprise Information Management
SAP HANA SPS10- Enterprise Information Management
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Último (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Mutable Data in Hive's Immutable World

  • 1. Page1 Mutable Data in Hive’s Immutable World Lester Martin – Hortonworks 2015 Hadoop Summit
  • 2. Page2 Connection before Content Lester Martin – Hortonworks Professional Services lmartin@hortonworks.com || lester.martin@gmail.com http://lester.website (links to blog, twitter, github, LI, FB, etc)
  • 3. Page3 “Traditional” Hadoop Data Time-Series Immutable (TSI) Data – Hive’s sweet spot Going beyond web logs to more exotic data such as: Vehicle sensors (ground, air, above/below water – space!) Patient data (to include the atmosphere around them) Smart phone/watch (TONS of info) Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured SOURCES
  • 4. Page4 Good TSI Solutions Exist Hive partitions •Store as much as you want •Only read the files you need Hive Streaming Data Ingest from Flume or Storm Sqoop’s –-incremental mode of append •Use appropriate –-check-column •“Saved Job” remembering –last-value
  • 5. Page5 Use Case for an Active Archive Evolving Domain Data – Hive likes immutable data Need exact copy of mutating tables refreshed periodically •Structural replica of multiple RDBMS tables •The data in these tables are being updated •Don’t need every change; just “as of” content Existing Systems ERP CRM SCM SOURCES eComm
  • 6. Page6 Start With a Full Refresh Strategy The epitome of the KISS principle •Ingest & load new data •Drop the existing table •Rename the newly created table Surely not elegant, but solves the problem until the reload takes longer than the refresh period
  • 7. Page7 Then Evolve to a Merge & Replace Strategy Typically, deltas are… •Small % of existing data •Plus, some totally new records In practice, differences in sizes of circles is often much more pronounced
  • 8. Page8 Requirements for Merge & Replace An immutable unique key •To determine if an addition or a change •The source table’s (natural or surrogate) PK is perfect A last-updated timestamp to find the deltas Leverage Sqoop’s –-incremental mode of lastmodified to identify the deltas •Use appropriate –-check-column •“Saved Job” remembering –last-value
  • 9. Page9 Processing Steps for Merge & Replace See blog at http://hortonworks.com/blog/four-step- strategy-incremental-updates-hive/, but note that merge can be done in multiple technologies, not just Hive Ingest – bring over the incremental data Reconcile – perform the merge Compact – replace the existing data with the newly merged content Purge – cleanup & prepare to repeat
  • 10. Page10 Full Merge & Replace Will NOT Scale The “elephant” eventually gets too big and merging it with the “mouse” takes too long! Example: A Hive structure with 100 billion rows, but only 100,000 delta records
  • 11. Page11 What Will? The Classic Hadoop Strategy!
  • 12. Page12 But… One Size Does NOT Fit All… Not everything is “big” – in fact, most operational apps’ tables are NOT too big for a simple Full Refresh Divide & Conquer requires additional per-table research to ensure the best partitioning strategy is decided upon
  • 13. Page13 Criteria for Active Archive Partition Values Non-nullable & immutable Ensures sliding scale growth with new records generally creating new partitions Supports delta records being skewed such that the percentage of partitions needing merge & replace operations is relatively small Classic value is (still) “Date Created”
  • 14. Page14 Work on (FEW!) Partitions in Parallel
  • 15. Page15 Partition-Level Merge & Replace Steps Generate the delta file Create list of affected partitions Perform merge & replace operations for affected partitions 1. Filter the delta file for the current partition 2. Load the Hive table’s current partition 3. Merge the two datasets 4. Delete the existing partition 5. Recreate the partition with the merged content
  • 16. Page16 What Does This Approach Look Like? A Lightning-Fast Review of an Indicative Hybrid Pig-Hive Example
  • 17. Page17 One-Time: Create the Table CREATE TABLE bogus_info( bogus_id int, field_one string, field_two string, field_three string) PARTITIONED BY (date_created STRING) STORED AS ORC TBLPROPERTIES ("orc.compress"="ZLIB");
  • 18. Page18 One-Time: Get Content from the Source 11,2014-09-17,base,base,base 12,2014-09-17,base,base,base 13,2014-09-17,base,base,base 14,2014-09-18,base,base,base 15,2014-09-18,base,base,base 16,2014-09-18,base,base,base 17,2014-09-19,base,base,base 18,2014-09-19,base,base,base 19,2014-09-19,base,base,base
  • 19. Page19 One-Time: Read Content from HDFS as_recd = LOAD '/user/fred/original.txt' USING PigStorage(',') AS ( bogus_id:int, date_created:chararray, field_one:chararray, field_two:chararray, field_three:chararray );
  • 20. Page20 One-Time: Sort and Insert into Hive Table sorted_as_recd = ORDER as_recd BY date_created, bogus_id; STORE sorted_as_recd INTO 'bogus_info' USING org.apache.hcatalog.pig.HCatStorer();
  • 21. Page21 One-Time: Verify Data are Present hive> select * from bogus_info; 11 base base base 2014-09-17 12 base base base 2014-09-17 13 base base base 2014-09-17 14 base base base 2014-09-18 15 base base base 2014-09-18 16 base base base 2014-09-18 17 base base base 2014-09-19 18 base base base 2014-09-19 19 base base base 2014-09-19
  • 22. Page22 One-Time: Verify Partitions are Present hdfs dfs -ls /apps/hive/warehouse/bogus_info Found 3 items … /apps/hive/warehouse/bogus_info/date_created =2014-09-17 … /apps/hive/warehouse/bogus_info/date_created =2014-09-18 … /apps/hive/warehouse/bogus_info/date_created =2014-09-19
  • 23. Page23 Generate the Delta File 20,2014-09-20,base,base,base 21,2014-09-20,base,base,base 22,2014-09-20,base,base,base 12,2014-09-17,base,CHANGED,base 14,2014-09-18,base,CHANGED,base 16,2014-09-18,base,CHANGED,base
  • 24. Page24 Read Delta File from HDFS delta_recd = LOAD '/user/fred/delta1.txt' USING PigStorage(',') AS ( bogus_id:int, date_created:chararray, field_one:chararray, field_two:chararray, field_three:chararray );
  • 25. Page25 Create List of Affected Partitions by_grp = GROUP delta_recd BY date_created; part_names = FOREACH by_grp GENERATE group; srtd_part_names = ORDER part_names BY group; STORE srtd_part_names INTO '/user/fred/affected_partitions’;
  • 26. Page26 Loop/Multithread Through Affected Partitions Pig doesn’t really help you with this problem This indicative example could be implemented as: •A simple script that loops through the partitions •A Java program that multi-threads the partition-aligned processing Multiple “Control Structures” options exist as described at http://pig.apache.org/docs/r0.14.0/cont.html
  • 27. Page27 Loop Step: Filter on the Current Partition delta_recd = LOAD '/user/fred/delta1.txt' USING PigStorage(',') AS ( bogus_id:int, date_created:chararray, field_one:chararray, field_two:chararray, field_three:chararray ); deltaP = FILTER delta_recd BY date_created == '$partition_key’;
  • 28. Page28 Loop Step: Retrieve Hive’s Current Partition all_bogus_info = LOAD 'bogus_info' USING org.apache.hcatalog.pig.HCatLoader(); tblP = FILTER all_bogus_info BY date_created == '$partition_key';
  • 29. Page29 Loop Step: Merge the Datasets partJ = JOIN tblP BY bogus_id FULL OUTER, deltaP BY bogus_id; combined_part = FOREACH partJ GENERATE ((deltaP::bogus_id is not null) ? deltaP::bogus_id: tblP::bogus_id) as bogus_id, /* do for all fields and end with “;” */
  • 30. Page30 Loop Step: Sort and Save the Merged Data s_combined_part = ORDER combined_part BY date_created, bogus_id; STORE s_combined_part INTO '/user/fred/ temp_$partition_key’ USING PigStorage(','); hdfs dfs –cat temp_2014-09-17/part-r-00000 11,2014-09-17,base,base,base 12,2014-09-17,base,CHANGED,base 13,2014-09-17,base,base,base
  • 31. Page31 Loop Step: Delete the Partition ALTER TABLE bogus_info DROP IF EXISTS PARTITION (date_created='2014-09-17’);
  • 32. Page32 Loop Step: Recreate the Partition 2load = LOAD '/user/fred/ temp_$partition_key' USING PigStorage(',') AS ( bogus_id:int, date_created:chararray, field_one:chararray, field_two:chararray, field_three:chararray ); STORE 2load INTO 'bogus_info' using org.apache.hcatalog.pig.HCatStorer();
  • 33. Page33 Verify the Loop Step Updates select * from bogus_info where date_created = '2014-09-17’; 11 base base base 2014-09-17 12 base CHANGED base 2014-09-17 13 base base base 2014-09-17
  • 34. Page34 My Head Hurts, Too! As Promised, We Flew Through That – Take Another Look Later
  • 35. Page35 What Does Merge & Replace Miss? If critical, you have options •Create a delete table sourced by a trigger •At some wide frequency, start all over with a Full Refresh Fortunately, ~most~ enterprises don’t delete anything Marking items “inactive” is popular
  • 36. Page36 Hybrid: Partition-Level Refresh If most of the partition is modified, just replace it entirely Especially if the changes are only recent (or highly skewed) Use a configured number of partitions to refresh and assume the rest of the data is static
  • 37. Page37 Active Archive Strategy Review strategy # of rows % of chg chg skew handles deletes complexity Full Refresh <= millions any any yes simple Full Merge & Replace <= millions any any no moderate Partition-Level Merge & Replace billions + < 5% < 5% no complex Partition-Level Refresh billions + < 5% < 5% yes complex
  • 38. Page38 Isn’t There Anything Easier? HIVE-5317 brought us Insert, Update & Delete •Alan Gates presented Monday •More tightly-coupled w/o the same “hazard windows” •“Driver” logic shifts to be delta-only & row-focused Thoughts & attempts at true DB replication •Some COTS solutions have been tried •Ideally, an open-source alternative is best such as enhancing the Streaming Data Ingest framework
  • 39. Page39 Considerations for HIVE-5317 On performance & scalability; your mileage may vary Does NOT make Hive a RDBMS Available in Hive .14 onwards DDL requirements •Must utilize partitioning & bucketing •Initially, only supports ORC
  • 40. Page40 Recommendations Take another look at this topic once back at “your desk” As with all things Hadoop… •Know your data & workloads •Try several approaches & evaluate results in earnest •Stick with the KISS principle whenever possible Share your findings via blogs and local user groups Expect (even more!) great things from Hive
  • 41. Page41 Questions? Lester Martin – Hortonworks Professional Services lmartin@hortonworks.com || lester.martin@gmail.com http://lester.website (links to blog, twitter, github, LI, FB, etc) THANKS FOR YOUR TIME!!

Notas do Editor

  1. This diagram shows the typical use case where the deltas represent a small percentage of the existing data as well as the addition of some records that are only present in the delta dataset. In practice, the differences in size of these two circles would be much more pronounced as the existing data includes historical records that have not been modified in a long time and most likely will not be modified again.
  2. If no last-updated timestamp is present, an alter table could create such a column and a combination of a DEFAULT value of current timestamp to have new records populate this and an “ON UPDATE” trigger could be created that takes care of updating the timestamp
  3. You can use other tools than just Hive, such as Pig, to perform these operations. Pig approach could be: Ingest the data (like above) In single Pig script; read old & new, merge them, load into “resolved” table Drop old table, rename new table, and recreate “resolved” table (for next run)
  4. When there is a small number of actual changes compared with a large number of unchanged records the incremental data ingest step brings limits the amount of data that needs to be transferred across the network as well as the amount of raw data that needs to be persisted in HDFS. There will be a point when the merging of a single delta file with a larger existing records file will take longer than just getting the full copy, or at least, the merge & replace processing will be too lengthy to be useful. An example of this could be a model where the existing data size is around 100 billion, but there are only 100,000 delta records.
  5. The classic 80/20 rules applies (maybe even 95/5) for table that don’t/do need partitioning for benefit of the merge & replace strategy. In fact, nothing would be prevent a table that uses Full Refresh or comprehensive Merge & Replace from having partitions.
  6. Bad examples would include a building identifier or the city or zip code that buildings are located within. We could get a wide spread of the data, but delta records would likely cover most, if not all, of the partitions .
  7. The goal is to break down the delta processing at the partition level and to be able to focus only on a subset of the overall partitions. In many ways, the processing is much like the basic merge & replace approach, but that several, much smaller, iterations will occur while leaving the mass majority of the partitions alone.
  8. We’ll go VERY FAST through this during the presentation, but will be useful to review later.
  9. 2014-09-17 2014-09-18 2014-09-20
  10. Here it is in its entirety… (simple db metadata reading & template builders could be created to automate this gnarly big FOREACH statement combined_partition = FOREACH partJ GENERATE ((deltaP::bogus_id is not null) ? deltaP::bogus_id: tblP::bogus_id) as bogus_id, ((deltaP::date_created is not null) ? deltaP::date_created: tblP::date_created) as date_created, ((deltaP::field_one is not null) ? deltaP::field_one: current_tblP::field_one) as field_one, ((deltaP::field_two is not null) ? deltaP::field_two: current_tblP::field_two) as field_two, ((deltaP::field_three is not null) ? deltaP::field_three: current_tblP::field_three) as field_three;
  11. We’ll go VERY FAST through this during the presentation, but will be useful to review later. Also, this is all rather generalizable so individual projects can build a simple framework to drive via metadata.
  12. Not a calculator, rule, or formula – just to drive the conversation of where each strategy might work best
  13. COTS solns from folks like Oracle, Dell & SAP, but none that I have evaluated
  14. Not picking on the CRUD operations performance & scalability opportunities, just pointing out that many variables are at play which could make things better or worse