SlideShare a Scribd company logo
1 of 21
Open Source Big Data Ingestion
Without the Heartburn!
Pat Patterson
Community Champion
@metadaddy
pat@streamsets.com
The Ingest Problem
Apache Flume
Apache Sqoop
Apache Nifi
StreamSets Data Collector
Demo
Agenda
Volume
Variety
Velocity
Veracity
Big Data Ingest
Free
Like a puppy
Difficulty
Fragility
Maintenance
Base Case - Custom Code
Originated at Cloudera
Inspired by Facebook Scribe - open source log
aggregation
Decentralized, distributed system of independent
agents
‘Off-cluster’ only
Opaque, record-oriented payload - byte arrays
Apache Flume
Apache Flume
Flume Agent
Source
Sink
Channel
Incoming
Data
Outgoing
Data
Interceptor
● Modify/drop events
in-flight
Sink
● Removes data from
Channel
● Sends data to
downstream Agent or
Destination
Channel
● Stores data in the
order received
Interceptor
Source
● Accepts incoming
Data
● Scales as required
● Writes data to
Channel
Apache Flume
Flume Agent
Flume Agent
Flume Agent
Works well for managing impedance mismatches between source and sink -
smooth out spikes in load
Log
HDFS
Apache Flume
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Combinatorial explosion of agents with tasks and record formats
Contextual processing is hard
Configuration validation is hard
No overall view of the system
Version 1.0 released Jan 2012
Latest version (1.6) released May 2015
Apache Flume
Originated at Cloudera
Transfer bulk data between RDBMS and Hadoop
Command-line tool
Breaks a table/query into ‘n’ partitions
‘On-cluster’ - runs as a ‘map-only’ job in MapReduce
‘High-Speed Connectors’ can take advantage of low-level
database features - Teradata, Exadata, Netezza etc
Apache Sqoop
Apache Sqoop
$ sqoop import-all-tables 
-m {{cluster_data.worker_node_hostname.length}} 
--connect jdbc:mysql://{{cluster_data.manager_node_hostname}}:3306/retail_db 
--username=retail_dba 
--password=cloudera 
--compression-codec=snappy 
--as-parquetfile 
--warehouse-dir=/user/hive/warehouse 
--hive-import
Batch mode only
Database credentials on command line, or shipped around in MapReduce config
Version 1.0.0 released June 2010
Latest version (1.4.6) released April 2015
Sqoop 2 proposed in SQOOP-365, Oct 2011
Latest (1.99.6) released May 2015, still ‘not intended for production deployment’
Apache Sqoop
Originated at NSA as Niagarafiles
Open sourced December 2014, Apache TLP July 2015
Opaque, file-oriented payload
Distributed system of processors with centralized control
Based on flow-based programming concepts
Data Provenance
Web-based user interface
Apache NiFi
Apache NiFi
Apache NiFi
Opaque files -> same combinatorial explosion as Flume
ConvertAvroToJSON, ConvertCSVToAvro,
ConvertJSONToAvro, ConvertJSONToSQL
Not really big data native
Breaks principle of data locality
Operates as own cluster
Founded by ex-Cloudera, Informatica employees
Continuous open source, intent-driven, big data ingest
Visible, record-oriented approach fixes combinatorial explosion
Batch or stream processing
Standalone, Spark cluster, MapReduce cluster
IDE for pipeline development by ‘civilians’
SDK for custom components (origin/processor/destination)
StreamSets Data Collector
StreamSets Data Collector
StreamSets Data Collector
Relatively new - first public release September 2015
So far, vast majority of commits are from StreamSets staff
SDC Demo
Apache Kafka
↘
StreamSets
Data Collector
↘
Apache Kudu
Flume - good for smoothing out impedance mismatches, but complex to deploy and maintain
Sqoop - good for database dumps, but not enterprise-friendly
Nifi - good for file-oriented flows, but not really big-data oriented
StreamSets Data Collector - good for continuous ingest pipelines, but relative newcomer
Conclusion
Thank You!
Pat Patterson
Community Champion
@metadaddy
pat@streamsets.com

More Related Content

What's hot

Membase Meetup 2010
Membase Meetup 2010Membase Meetup 2010
Membase Meetup 2010
Membase
 

What's hot (20)

Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
 
Membase Meetup 2010
Membase Meetup 2010Membase Meetup 2010
Membase Meetup 2010
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with Rocana
 
Streamsets and spark at SF Hadoop User Group
Streamsets and spark at SF Hadoop User GroupStreamsets and spark at SF Hadoop User Group
Streamsets and spark at SF Hadoop User Group
 
Superset druid realtime
Superset druid realtimeSuperset druid realtime
Superset druid realtime
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
Automatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsAutomatic Scaling Iterative Computations
Automatic Scaling Iterative Computations
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Visualizing big data in the browser using spark
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using spark
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 
Open source data ingestion
Open source data ingestionOpen source data ingestion
Open source data ingestion
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 

Viewers also liked

Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
skaluska
 
Basic data ingestion in r
Basic data ingestion in rBasic data ingestion in r
Basic data ingestion in r
Jacob Rideout
 
Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 Keynote
Roger Barga
 
Informatica object migration
Informatica object migrationInformatica object migration
Informatica object migration
Amit Sharma
 

Viewers also liked (20)

Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion Pipelines
 
Logging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorLogging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data Collector
 
Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
 
Adaptive Data Cleansing with StreamSets and Cassandra
Adaptive Data Cleansing with StreamSets and CassandraAdaptive Data Cleansing with StreamSets and Cassandra
Adaptive Data Cleansing with StreamSets and Cassandra
 
Real time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudReal time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid Cloud
 
Jitney, Kafka at Airbnb
Jitney, Kafka at AirbnbJitney, Kafka at Airbnb
Jitney, Kafka at Airbnb
 
High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into Hadoop
 
Reliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at AirbnbReliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at Airbnb
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Hadoop data ingestion
Hadoop data ingestionHadoop data ingestion
Hadoop data ingestion
 
Basic data ingestion in r
Basic data ingestion in rBasic data ingestion in r
Basic data ingestion in r
 
Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 Keynote
 
Informatica object migration
Informatica object migrationInformatica object migration
Informatica object migration
 
Bad Data is Polluting Big Data
Bad Data is Polluting Big DataBad Data is Polluting Big Data
Bad Data is Polluting Big Data
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
 

Similar to Open Source Big Data Ingestion - Without the Heartburn!

ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
Timothy Spann
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
Timothy Spann
 

Similar to Open Source Big Data Ingestion - Without the Heartburn! (20)

ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
 
CODEONTHEBEACH_Streaming Applications with Apache Pulsar
CODEONTHEBEACH_Streaming Applications with Apache PulsarCODEONTHEBEACH_Streaming Applications with Apache Pulsar
CODEONTHEBEACH_Streaming Applications with Apache Pulsar
 
Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark StreamingReal-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing HubIMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
 
Building Modern Data Streaming Apps with Python
Building Modern Data Streaming Apps with PythonBuilding Modern Data Streaming Apps with Python
Building Modern Data Streaming Apps with Python
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
All Day DevOps - FLiP Stack for Cloud Data Lakes
All Day DevOps - FLiP Stack for Cloud Data LakesAll Day DevOps - FLiP Stack for Cloud Data Lakes
All Day DevOps - FLiP Stack for Cloud Data Lakes
 
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
 
DBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data LakesDBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data Lakes
 
Data science online camp using the flipn stack for edge ai (flink, nifi, pu...
Data science online camp   using the flipn stack for edge ai (flink, nifi, pu...Data science online camp   using the flipn stack for edge ai (flink, nifi, pu...
Data science online camp using the flipn stack for edge ai (flink, nifi, pu...
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
 

More from Pat Patterson

Salesforce Integration with Twilio
Salesforce Integration with TwilioSalesforce Integration with Twilio
Salesforce Integration with Twilio
Pat Patterson
 

More from Pat Patterson (20)

DevOps from the Provider Perspective
DevOps from the Provider PerspectiveDevOps from the Provider Perspective
DevOps from the Provider Perspective
 
How Imprivata Combines External Data Sources for Business Insights
How Imprivata Combines External Data Sources for Business InsightsHow Imprivata Combines External Data Sources for Business Insights
How Imprivata Combines External Data Sources for Business Insights
 
Data Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, HowData Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, How
 
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
 
Dealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakeDealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data Lake
 
Integrating with Einstein Analytics
Integrating with Einstein AnalyticsIntegrating with Einstein Analytics
Integrating with Einstein Analytics
 
Efficient Schemas in Motion with Kafka and Schema Registry
Efficient Schemas in Motion with Kafka and Schema RegistryEfficient Schemas in Motion with Kafka and Schema Registry
Efficient Schemas in Motion with Kafka and Schema Registry
 
Dealing With Drift - Building an Enterprise Data Lake
Dealing With Drift - Building an Enterprise Data LakeDealing With Drift - Building an Enterprise Data Lake
Dealing With Drift - Building an Enterprise Data Lake
 
All Aboard the Boxcar! Going Beyond the Basics of REST
All Aboard the Boxcar! Going Beyond the Basics of RESTAll Aboard the Boxcar! Going Beyond the Basics of REST
All Aboard the Boxcar! Going Beyond the Basics of REST
 
Provisioning IDaaS - Using SCIM to Enable Cloud Identity
Provisioning IDaaS - Using SCIM to Enable Cloud IdentityProvisioning IDaaS - Using SCIM to Enable Cloud Identity
Provisioning IDaaS - Using SCIM to Enable Cloud Identity
 
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
 
Enterprise IoT: Data in Context
Enterprise IoT: Data in ContextEnterprise IoT: Data in Context
Enterprise IoT: Data in Context
 
OData: A Standard API for Data Access
OData: A Standard API for Data AccessOData: A Standard API for Data Access
OData: A Standard API for Data Access
 
API-Driven Relationships: Building The Trans-Internet Express of the Future
API-Driven Relationships: Building The Trans-Internet Express of the FutureAPI-Driven Relationships: Building The Trans-Internet Express of the Future
API-Driven Relationships: Building The Trans-Internet Express of the Future
 
Using Salesforce to Manage Your Developer Community
Using Salesforce to Manage Your Developer CommunityUsing Salesforce to Manage Your Developer Community
Using Salesforce to Manage Your Developer Community
 
Identity in the Cloud
Identity in the CloudIdentity in the Cloud
Identity in the Cloud
 
OpenID Connect: An Overview
OpenID Connect: An OverviewOpenID Connect: An Overview
OpenID Connect: An Overview
 
How I Learned to Stop Worrying and Love Open Source Identity (Paris Edition)
How I Learned to Stop Worrying and Love Open Source Identity (Paris Edition)How I Learned to Stop Worrying and Love Open Source Identity (Paris Edition)
How I Learned to Stop Worrying and Love Open Source Identity (Paris Edition)
 
Salesforce Integration with Twilio
Salesforce Integration with TwilioSalesforce Integration with Twilio
Salesforce Integration with Twilio
 
SAML Smackdown
SAML SmackdownSAML Smackdown
SAML Smackdown
 

Recently uploaded

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Recently uploaded (20)

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 

Open Source Big Data Ingestion - Without the Heartburn!

  • 1. Open Source Big Data Ingestion Without the Heartburn! Pat Patterson Community Champion @metadaddy pat@streamsets.com
  • 2. The Ingest Problem Apache Flume Apache Sqoop Apache Nifi StreamSets Data Collector Demo Agenda
  • 5. Originated at Cloudera Inspired by Facebook Scribe - open source log aggregation Decentralized, distributed system of independent agents ‘Off-cluster’ only Opaque, record-oriented payload - byte arrays Apache Flume
  • 6. Apache Flume Flume Agent Source Sink Channel Incoming Data Outgoing Data Interceptor ● Modify/drop events in-flight Sink ● Removes data from Channel ● Sends data to downstream Agent or Destination Channel ● Stores data in the order received Interceptor Source ● Accepts incoming Data ● Scales as required ● Writes data to Channel
  • 7. Apache Flume Flume Agent Flume Agent Flume Agent Works well for managing impedance mismatches between source and sink - smooth out spikes in load Log HDFS
  • 8. Apache Flume # example.conf: A single-node Flume configuration # Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
  • 9. Combinatorial explosion of agents with tasks and record formats Contextual processing is hard Configuration validation is hard No overall view of the system Version 1.0 released Jan 2012 Latest version (1.6) released May 2015 Apache Flume
  • 10. Originated at Cloudera Transfer bulk data between RDBMS and Hadoop Command-line tool Breaks a table/query into ‘n’ partitions ‘On-cluster’ - runs as a ‘map-only’ job in MapReduce ‘High-Speed Connectors’ can take advantage of low-level database features - Teradata, Exadata, Netezza etc Apache Sqoop
  • 11. Apache Sqoop $ sqoop import-all-tables -m {{cluster_data.worker_node_hostname.length}} --connect jdbc:mysql://{{cluster_data.manager_node_hostname}}:3306/retail_db --username=retail_dba --password=cloudera --compression-codec=snappy --as-parquetfile --warehouse-dir=/user/hive/warehouse --hive-import
  • 12. Batch mode only Database credentials on command line, or shipped around in MapReduce config Version 1.0.0 released June 2010 Latest version (1.4.6) released April 2015 Sqoop 2 proposed in SQOOP-365, Oct 2011 Latest (1.99.6) released May 2015, still ‘not intended for production deployment’ Apache Sqoop
  • 13. Originated at NSA as Niagarafiles Open sourced December 2014, Apache TLP July 2015 Opaque, file-oriented payload Distributed system of processors with centralized control Based on flow-based programming concepts Data Provenance Web-based user interface Apache NiFi
  • 15. Apache NiFi Opaque files -> same combinatorial explosion as Flume ConvertAvroToJSON, ConvertCSVToAvro, ConvertJSONToAvro, ConvertJSONToSQL Not really big data native Breaks principle of data locality Operates as own cluster
  • 16. Founded by ex-Cloudera, Informatica employees Continuous open source, intent-driven, big data ingest Visible, record-oriented approach fixes combinatorial explosion Batch or stream processing Standalone, Spark cluster, MapReduce cluster IDE for pipeline development by ‘civilians’ SDK for custom components (origin/processor/destination) StreamSets Data Collector
  • 18. StreamSets Data Collector Relatively new - first public release September 2015 So far, vast majority of commits are from StreamSets staff
  • 19. SDC Demo Apache Kafka ↘ StreamSets Data Collector ↘ Apache Kudu
  • 20. Flume - good for smoothing out impedance mismatches, but complex to deploy and maintain Sqoop - good for database dumps, but not enterprise-friendly Nifi - good for file-oriented flows, but not really big-data oriented StreamSets Data Collector - good for continuous ingest pipelines, but relative newcomer Conclusion
  • 21. Thank You! Pat Patterson Community Champion @metadaddy pat@streamsets.com