August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

•Transferir como PPTX, PDF•

0 gostou•26,052 visualizações

Big data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC) is an open source platform for building big data ingest pipelines that allows you to design, execute and monitor robust data flows. In this session we'll look at how SDC's "intent-driven" approach keeps the data flowing, whether you're processing data 'off-cluster', in Spark, or in MapReduce. StreamSets software delivers performance management for data flows that feed the next generation of big data applications. Its mission is to bring operational excellence to the management of data in motion, so that data arrives on time and with quality, accelerating analysis and decision making. StreamSets Data Collector is in use at hundreds of companies where it brings unprecedented visibility into and control over data as it moves between an expanding variety of sources and destinations. Speakers: Pat Patterson has been working with Internet technologies since 1997, building software and working with communities at Sun Microsystems, Huawei, Salesforce and StreamSets. At Sun, Pat was the community lead for the OpenSSO open source project, while at Huawei he developed cloud storage infrastructure software. Part of the developer evangelism team at Salesforce, Pat focused on identity, integration and the Internet of Things. Now community champion at StreamSets, Pat is responsible for the care and feeding of the StreamSets open source community.

Tecnologia

Open Source Big Data Ingest with
StreamSets Data Collector
Pat Patterson
Community Champion
@metadaddy
pat@streamsets.com

Traditional and Big Data
Founders
Company Background
Top tier Investors
Momentum to Date
Strategic Partners
Launched 2014; exited stealth 9/15
~30 employees
Double-digit enterprise customers
10,000 downloads

Past ETL ETL
Emerging Ingest Analyze
Data Sources Data Stores Data Consumers
Market Trends

Data Drift
The unpredictable, unannounced and unending mutation of data characteristics caused by
the operation, maintenance and modernization of the systems that produce the data
Structure
Drift
Semantic
Drift
Infrastructure
Drift

Delayed and
False Insights
Solving Data Drift
Tools
Applications
Data Stores Data ConsumersData Sources
Poor Data QualityData Drift
Custom code
Fixed-schema

Trusted InsightsData KPIs
Solving Data Drift
Tools
Applications
Data Stores Data ConsumersData Sources
Data Drift
Intent-Driven
Drift-Handling

SQL on Hadoop (Hive) Y/Y Click Through Rate
80% of analyst time is spent preparing and validating data,
while the remaining 20% is actual data analysis
Example: Data Loss and Corrosion

StreamSets Data Collector
Open source software for the
rapid development and
reliably operation of complex
data flows.
➢ Efficiency
➢ Control
➢ Agility

SDC Demo
StreamSets
Data Collector
Apache Kafka
Apache Kudu
↘
↘

SF Bay Area Data Ingest Meetup - Aug 25, Palo Alto, CA
MapR Big Data Everywhere - Aug 30, San Francisco, CA
Strata + Hadoop World - Sep 27-29, New York, NY
Upcoming Events

Structure
Drift
Data structures and
formats evolve and
change unexpectedly
Implication:
Data Loss
Data Squandering
Delimited Data
107.3.137.195 fe80::21b:21ff:fe83:90fa
Attribute Format
Changes
{
“first“: “jon”
“last“: “smith”
“email“: “jsmith@acme.com”
“add1“: “123 Washington”
“add2“: “”
“city“: “Tucson”
“state“: “AZ”
“zip“: “85756”
}
{
“first“: “jane”
“last“: “smith”
“email“: “jane@earth.net”
“add1“: “456 Fillmore”
“add2“: “Apt 120”
“city“: “Fairfield”
“state“: “VA”
“zip“: “24435-1001”
“phone”: “401-555-1212”
}
Data Structure Evolution
Structure Drift

Semantic
Drift
Data semantics change
with evolving applications
Implication:
Data Corrosion
Data Loss
Semantic Drift
24122-52172 00-24122-52172
Account Number Expansion
M134: user {jsmith} read access granted {ac:24122-52172}
M134: user {jsmith} read access granted {ca.ac:24122-52172}
Namespace Qualification
……
…,3588310669797950,$91.41,jcb,K1088-W#9,…
…,6759006011936944,$155.04,switch,A6504-Y#9,…
…,6771111111151415,$37.78,laser,Q9936-T#9,…
…,3585905063294299,$164.48,jcb,S4643-H#9,…
…,5363527828638736,$117.52,mastercard,X3286-P#9,…
…,4903080150282806,$168.03,switch,I9133-W#3,…
……
Outlier / Anomaly Detection

Infrastructure
Drift
Physical and Logical
Infrastructure changes
rapidly
Implication:
Poor Agility
Operational Downtime
Data Center 1 Data Center 2 Data Center n
3rd Party Service Provider
App a App k
App q
Cloud
Infrastructure
Infrastructure Drift

Mais conteúdo relacionado

Mais procurados

Streamsets and sparkHari Shreedharan

Presto: SQL-on-anythingDataWorks Summit

Solving Performance Problems on HadoopTyler Mitchell

A Stock Prediction System using Open-Source SoftwareFred Melo

WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit

Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSpark Summit

Delta Lake: Open Source Reliability w/ Apache SparkGeorge Chow

Building Continuously Curated Ingestion PipelinesArvind Prabhakar

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks

Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Spark Summit

Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit

Building a Federated Data Directory Platform for Public HealthDatabricks

Spark in the Enterprise - 2 Years Later by Alan SaldichSpark Summit

Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !

MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit

Hadoop in Validated Environment - Data Governance InitiativeDataWorks Summit

Enterprise large scale graph analytics and computing base on distribute graph...DataWorks Summit

Streamsets and spark at SF Hadoop User GroupHari Shreedharan

"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data

Mais procurados (20)

Streamsets and spark

Presto: SQL-on-anything

Solving Performance Problems on Hadoop

A Stock Prediction System using Open-Source Software

WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...

Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Delta Lake: Open Source Reliability w/ Apache Spark

Building Continuously Curated Ingestion Pipelines

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...

Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...

Realtime Analytical Query Processing and Predictive Model Building on High Di...

Building a Federated Data Directory Platform for Public Health

Spark in the Enterprise - 2 Years Later by Alan Saldich

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...

Hadoop in Validated Environment - Data Governance Initiative

Enterprise large scale graph analytics and computing base on distribute graph...

Streamsets and spark at SF Hadoop User Group

"Who Moved my Data? - Why tracking changes and sources of data is critical to...

Semelhante a August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Spark Summit EU talk by Pat PattersonSpark Summit

Oil and gas big data editionMark Kerzner

The Maturity Model: Taking the Growing Pains Out of HadoopInside Analysis

BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...Big Data Week

Balancing data democratization with comprehensive information governance: bui...DataWorks Summit

Hadoop 2.0: YARN to Further Optimize Data ProcessingHortonworks

Hortonworks and HP Vertica WebinarHortonworks

Setting Up the Data LakeCaserta

Big data journey to the cloud maz chaudhri 5.30.18Cloudera, Inc.

8.17.11 big data and hadoop with informatica slideshareJulianna DeLua

2015 02 12 talend hortonworks webinar challenges to hadoop adoptionHortonworks

Rev_3 Components of a Data WarehouseRyan Andhavarapu

CWIN17 India / Bigdata architecture yashowardhan sowaleCapgemini

A Winning Strategy for the Digital EconomyEric Kavanagh

Data lake benefitsRicky Barron

Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Hortonworks

Solving Big Data Problems using Hortonworks DataWorks Summit/Hadoop Summit

IoT Crash Course Hadoop Summit SJDaniel Madrigal

Predictive Analytics - Big Data Warehousing MeetupCaserta

Keeping the Pulse of Your Data: Why You Need Data Observability Precisely

Semelhante a August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector (20)

Spark Summit EU talk by Pat Patterson

Oil and gas big data edition

The Maturity Model: Taking the Growing Pains Out of Hadoop

BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...

Balancing data democratization with comprehensive information governance: bui...

Hadoop 2.0: YARN to Further Optimize Data Processing

Hortonworks and HP Vertica Webinar

Setting Up the Data Lake

Big data journey to the cloud maz chaudhri 5.30.18

8.17.11 big data and hadoop with informatica slideshare

2015 02 12 talend hortonworks webinar challenges to hadoop adoption

Rev_3 Components of a Data Warehouse

CWIN17 India / Bigdata architecture yashowardhan sowale

A Winning Strategy for the Digital Economy

Data lake benefits

Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...

Solving Big Data Problems using Hortonworks

IoT Crash Course Hadoop Summit SJ

Predictive Analytics - Big Data Warehousing Meetup

Keeping the Pulse of Your Data: Why You Need Data Observability

Mais de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network

Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network

CICD at Oath using ScrewdriverYahoo Developer Network

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network

Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network

Architecting Petabyte Scale AI ApplicationsYahoo Developer Network

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network

Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network

February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network

February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network

Mais de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...

Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...

CICD at Oath using Screwdriver

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...

Moving the Oath Grid to Docker, Eric Badger, Oath

Architecting Petabyte Scale AI Applications

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...

Jun 2017 HUG: YARN Scheduling – A Step Beyond

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...

February 2017 HUG: Exactly-once end-to-end processing with Apache Apex

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Último

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

How to convert PDF to text with Nanonetsnaman860154

Key Features Of Token Development (1).pptxLBM Solutions

Understanding the Laravel MVC ArchitecturePixlogix Infotech

AI as an Interface for Commercial BuildingsMemoori

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

1. Open Source Big Data Ingest with StreamSets Data Collector Pat Patterson Community Champion @metadaddy pat@streamsets.com

2. Traditional and Big Data Founders Company Background Top tier Investors Momentum to Date Strategic Partners Launched 2014; exited stealth 9/15 ~30 employees Double-digit enterprise customers 10,000 downloads

3. Past ETL ETL Emerging Ingest Analyze Data Sources Data Stores Data Consumers Market Trends

4. Data Drift The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data Structure Drift Semantic Drift Infrastructure Drift

5. Delayed and False Insights Solving Data Drift Tools Applications Data Stores Data ConsumersData Sources Poor Data QualityData Drift Custom code Fixed-schema

6. Trusted InsightsData KPIs Solving Data Drift Tools Applications Data Stores Data ConsumersData Sources Data Drift Intent-Driven Drift-Handling

7. SQL on Hadoop (Hive) Y/Y Click Through Rate 80% of analyst time is spent preparing and validating data, while the remaining 20% is actual data analysis Example: Data Loss and Corrosion

8. StreamSets Data Collector Open source software for the rapid development and reliably operation of complex data flows. ➢ Efficiency ➢ Control ➢ Agility

9. SDC Demo StreamSets Data Collector Apache Kafka Apache Kudu ↘ ↘

10. SF Bay Area Data Ingest Meetup - Aug 25, Palo Alto, CA MapR Big Data Everywhere - Aug 30, San Francisco, CA Strata + Hadoop World - Sep 27-29, New York, NY Upcoming Events

11. Thank You!

12. Structure Drift Data structures and formats evolve and change unexpectedly Implication: Data Loss Data Squandering Delimited Data 107.3.137.195 fe80::21b:21ff:fe83:90fa Attribute Format Changes { “first“: “jon” “last“: “smith” “email“: “jsmith@acme.com” “add1“: “123 Washington” “add2“: “” “city“: “Tucson” “state“: “AZ” “zip“: “85756” } { “first“: “jane” “last“: “smith” “email“: “jane@earth.net” “add1“: “456 Fillmore” “add2“: “Apt 120” “city“: “Fairfield” “state“: “VA” “zip“: “24435-1001” “phone”: “401-555-1212” } Data Structure Evolution Structure Drift

13. Semantic Drift Data semantics change with evolving applications Implication: Data Corrosion Data Loss Semantic Drift 24122-52172 00-24122-52172 Account Number Expansion M134: user {jsmith} read access granted {ac:24122-52172} M134: user {jsmith} read access granted {ca.ac:24122-52172} Namespace Qualification …… …,3588310669797950,$91.41,jcb,K1088-W#9,… …,6759006011936944,$155.04,switch,A6504-Y#9,… …,6771111111151415,$37.78,laser,Q9936-T#9,… …,3585905063294299,$164.48,jcb,S4643-H#9,… …,5363527828638736,$117.52,mastercard,X3286-P#9,… …,4903080150282806,$168.03,switch,I9133-W#3,… …… Outlier / Anomaly Detection

14. Infrastructure Drift Physical and Logical Infrastructure changes rapidly Implication: Poor Agility Operational Downtime Data Center 1 Data Center 2 Data Center n 3rd Party Service Provider App a App k App q Cloud Infrastructure Infrastructure Drift

August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Semelhante a August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector (20)

Mais de Yahoo Developer Network

Mais de Yahoo Developer Network (20)

Último

Último (20)

August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector