SlideShare a Scribd company logo
1 of 28
Download to read offline
Dealing with
Unstructured Data
Scaling to Infinity
Image: Boykung/Shutterstock
Image: John Hammink
There are many sources of
information
Copyright ©2014 Treasure Data. All Rights Reserved.
Results Push
Results Push
SQL
Big Data Simplified: One ApproachAppServers
Multi-structured Events
• register
• login
• start_event
• purchase
• etc
SQL-based
Ad-hoc Queries
SQL-based Dashboards
DBs & Data Marts
Other Apps
Results Push
Familiar &
Table-oriented
Infinite & Economical
Cloud Data Store
✓App log data
✓Mobile event data
✓Sensor data
✓Telemetry
Mobile SDKs
Web SDK
Multi-structured Events
Multi-structured Events
Multi-structured Events
Multi-structured Events
Agent
Agent
Agent
Agent Agent
Agent
Agent
Agent
Embedded SDKs
Server-side Agents
Copyright ©2014 Treasure Data. All Rights Reserved.
What is the point of all this data?
BI
Business
Intelligence
Using Very Large
Sets of Data
Copyright ©2015 Treasure Data. All Rights
Reserved.
Service Launched
Series A Funding
100 Customers
Selected by Gartner as
Cool Vendor in Big Data
10 Trillion
Records
5 Trillion Records
Treasure Data By the Numbers (Jan-2015):
13T+ records of data imported since launch
500K+ records imported each second
1.5 Trillion+ records imported each month
12B records sent per day by one customer
13 Trillion Records
Series B Funding
Data Records Stored in the Treasure Data Cloud Service
0
3500000000000
7000000000000
10500000000000
14000000000000
Aug-12 Oct-12 Dec-12 Feb-13 Apr-13 Jun-13 Aug-13 Oct-13 Dec-13 Feb-14 Apr-14 Jun-14 Aug-14 Oct-14 Dec-14
8
Last 2 years
Statistics
Total Records
Stored
25
Trillion
Managed &
Supported
24 * 7 *
365
Uptime
99.99%
New Records /
second
1
Million Daily Twitter
volume
100x
1 0 1 1 0
0 0 1 0 1
1 1 0 0
0 0 1
24 / 7
A solution?
• There are trade-offs to consider
• Any trade off should make it easy to collect data
• Easy does it! un- and semi-structured data (multi-
structured data)
• Open source means it’s free; also means that you need
someone on hand to maintain and implement
• Cloud storage means you don’t have to scale and/or
shard; tradeoff means performance hit against bare metal
Image: John Hammink
Image: Dreamstime
Images: Lightspring/Shutterstock, John Hammink, Treasure Data
There are a few intro to
Data Science blogs at
blog.treasuredata.com!
What does a pipeline need?
Open vs. Closed source
Image: Heather Craig/Shutterstock
Images: PC World, Data-Hive, Wallpapersmela
or
or
?
LAMBDA ARCHITECTURE
# logs from a file
<source>
type tail
path /var/log/
httpd.log
format apache2
tag web.access
</source>
# logs from client
libraries
<source>
type forward
port 24224
</source>
# store logs to ES and
HDFS
<match *.*>
type copy
<store>
type elasticsearch
logstash_format
LESS SIMPLE FORWARDING
Before fluentd
Multi- structured data
• un-structured data
better for data for
ultimate use in
statistics
fluentd!
http://www.fluentd.org/
http://msgpack.org/
an open-source bulk data loader that helps data
transfer between various databases, storages, file
formats, and cloud services
embulk.org/docs
Hivemall
Hivemall is a scalable machine learning library that
runs on Apache Hive.
Hivemall is designed to be scalable to the number
of training instances as well as the number of
training features.
• Classification
• Regression
• Recommendation
• k-nearest neighbor
• Anomaly Detection
• Feature Engineering
https://github.com/myui/hivemall
The Hadoop Story on MongoDB
Image courtesy of Steven Francia @ Docker
Questions?

More Related Content

What's hot

Oracle Document Cloud Service
Oracle Document Cloud ServiceOracle Document Cloud Service
Oracle Document Cloud Service
Arush Jain
 

What's hot (20)

E-Commerce and MongoDB at Backcountry.com
E-Commerce and MongoDB at Backcountry.comE-Commerce and MongoDB at Backcountry.com
E-Commerce and MongoDB at Backcountry.com
 
Everything you need to know about external sharing in OneDrive, SharePoint, a...
Everything you need to know about external sharing in OneDrive, SharePoint, a...Everything you need to know about external sharing in OneDrive, SharePoint, a...
Everything you need to know about external sharing in OneDrive, SharePoint, a...
 
Dutch Information Worker User Group - January 2022 - eDiscovery and Microsoft...
Dutch Information Worker User Group - January 2022 - eDiscovery and Microsoft...Dutch Information Worker User Group - January 2022 - eDiscovery and Microsoft...
Dutch Information Worker User Group - January 2022 - eDiscovery and Microsoft...
 
What your IT Doesn't Know about Publishing DITA Content
What your IT Doesn't Know about Publishing DITA ContentWhat your IT Doesn't Know about Publishing DITA Content
What your IT Doesn't Know about Publishing DITA Content
 
O365Engage17 - Protecting O365 Data in a Modern World
O365Engage17 - Protecting O365 Data in a Modern WorldO365Engage17 - Protecting O365 Data in a Modern World
O365Engage17 - Protecting O365 Data in a Modern World
 
What’s new in SharePoint 2016!
What’s new in SharePoint 2016!What’s new in SharePoint 2016!
What’s new in SharePoint 2016!
 
Hybrid Dilemma: Dividing Content Between Azure, Office 365 & SharePoint 2016
Hybrid Dilemma: Dividing Content Between Azure, Office 365 & SharePoint 2016Hybrid Dilemma: Dividing Content Between Azure, Office 365 & SharePoint 2016
Hybrid Dilemma: Dividing Content Between Azure, Office 365 & SharePoint 2016
 
OneDrive & SharePoint Better Together
OneDrive & SharePoint Better TogetherOneDrive & SharePoint Better Together
OneDrive & SharePoint Better Together
 
Data Security and Protection in DevOps
Data Security and Protection in DevOps Data Security and Protection in DevOps
Data Security and Protection in DevOps
 
SharePoint 2013 ediscovery overview
SharePoint 2013 ediscovery overviewSharePoint 2013 ediscovery overview
SharePoint 2013 ediscovery overview
 
Oracle Document Cloud Service
Oracle Document Cloud ServiceOracle Document Cloud Service
Oracle Document Cloud Service
 
SharePoint Saturday Ottawa - How secure is my data in office 365?
SharePoint Saturday Ottawa - How secure is my data in office 365?SharePoint Saturday Ottawa - How secure is my data in office 365?
SharePoint Saturday Ottawa - How secure is my data in office 365?
 
Good to Great SharePoint Governance
Good to Great SharePoint GovernanceGood to Great SharePoint Governance
Good to Great SharePoint Governance
 
Oracle documents cloud service
Oracle documents cloud serviceOracle documents cloud service
Oracle documents cloud service
 
O365Engage17 - Skype for Business Cloud PBX in the Real World
O365Engage17 - Skype for Business Cloud PBX in the Real WorldO365Engage17 - Skype for Business Cloud PBX in the Real World
O365Engage17 - Skype for Business Cloud PBX in the Real World
 
Delve and the Office Graph for IT- Pros & Admins
Delve and the Office Graph for IT- Pros & AdminsDelve and the Office Graph for IT- Pros & Admins
Delve and the Office Graph for IT- Pros & Admins
 
SharePoint Migration Series: Success Takes Three Actions
SharePoint Migration Series: Success Takes Three ActionsSharePoint Migration Series: Success Takes Three Actions
SharePoint Migration Series: Success Takes Three Actions
 
Is BCS Dead?
Is BCS Dead?Is BCS Dead?
Is BCS Dead?
 
Governance is Not An Option
Governance is Not An OptionGovernance is Not An Option
Governance is Not An Option
 
Navigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePointNavigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePoint
 

Viewers also liked

Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Peter Wren-Hilton
 
Using Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementUsing Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data Management
DataWorks Summit
 

Viewers also liked (9)

Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016
 
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured DataHotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
 
Lecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data WarehouseLecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data Warehouse
 
The Analytic System: Finding Patterns in the Data
The Analytic System: Finding Patterns in the DataThe Analytic System: Finding Patterns in the Data
The Analytic System: Finding Patterns in the Data
 
Unstructured Data in BI
Unstructured Data in BIUnstructured Data in BI
Unstructured Data in BI
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop Webinar
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ Data
 
Using Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementUsing Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data Management
 

Similar to Dealing with Unstructured Data: Scaling to Infinity

SendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data WarehousingSendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data Warehousing
Amazon Web Services
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
Moacyr Passador
 

Similar to Dealing with Unstructured Data: Scaling to Infinity (20)

Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Data Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingData Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data Warehousing
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI Architecture
 
datavault2.pptx
datavault2.pptxdatavault2.pptx
datavault2.pptx
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Reliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTReliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoT
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
Building IoT and Big Data Solutions on Azure
Building IoT and Big Data Solutions on AzureBuilding IoT and Big Data Solutions on Azure
Building IoT and Big Data Solutions on Azure
 
Take Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessTake Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven Business
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
SendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data WarehousingSendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data Warehousing
 
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 

More from Great Wide Open

More from Great Wide Open (20)

The Little Meetup That Could
The Little Meetup That CouldThe Little Meetup That Could
The Little Meetup That Could
 
Lightning Talk - 5 Hacks to Getting the Job of Your Dreams
Lightning Talk - 5 Hacks to Getting the Job of Your DreamsLightning Talk - 5 Hacks to Getting the Job of Your Dreams
Lightning Talk - 5 Hacks to Getting the Job of Your Dreams
 
You Don't Know Node: Quick Intro to 6 Core Features
You Don't Know Node: Quick Intro to 6 Core FeaturesYou Don't Know Node: Quick Intro to 6 Core Features
You Don't Know Node: Quick Intro to 6 Core Features
 
Hidden Features in HTTP
Hidden Features in HTTPHidden Features in HTTP
Hidden Features in HTTP
 
Using Cryptography Properly in Applications
Using Cryptography Properly in ApplicationsUsing Cryptography Properly in Applications
Using Cryptography Properly in Applications
 
Lightning Talk - Getting Students Involved In Open Source
Lightning Talk - Getting Students Involved In Open SourceLightning Talk - Getting Students Involved In Open Source
Lightning Talk - Getting Students Involved In Open Source
 
How Constraints Cultivate Growth
How Constraints Cultivate GrowthHow Constraints Cultivate Growth
How Constraints Cultivate Growth
 
Inner Source 101
Inner Source 101Inner Source 101
Inner Source 101
 
Running MySQL on Linux
Running MySQL on LinuxRunning MySQL on Linux
Running MySQL on Linux
 
Search is the new UI
Search is the new UISearch is the new UI
Search is the new UI
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
The Current Messaging Landscape
The Current Messaging LandscapeThe Current Messaging Landscape
The Current Messaging Landscape
 
Apache httpd v2.4
Apache httpd v2.4Apache httpd v2.4
Apache httpd v2.4
 
Understanding Open Source Class 101
Understanding Open Source Class 101Understanding Open Source Class 101
Understanding Open Source Class 101
 
Thinking in Git
Thinking in GitThinking in Git
Thinking in Git
 
Antifragile Design
Antifragile DesignAntifragile Design
Antifragile Design
 
Elasticsearch for SQL Users
Elasticsearch for SQL UsersElasticsearch for SQL Users
Elasticsearch for SQL Users
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
Access by Default
Access by DefaultAccess by Default
Access by Default
 
Migrating to Free Software: a Reference Protocol for LibreOffce
Migrating to Free Software: a Reference Protocol for LibreOffceMigrating to Free Software: a Reference Protocol for LibreOffce
Migrating to Free Software: a Reference Protocol for LibreOffce
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Dealing with Unstructured Data: Scaling to Infinity

  • 1. Dealing with Unstructured Data Scaling to Infinity Image: Boykung/Shutterstock
  • 3.
  • 4. There are many sources of information
  • 5. Copyright ©2014 Treasure Data. All Rights Reserved. Results Push Results Push SQL Big Data Simplified: One ApproachAppServers Multi-structured Events • register • login • start_event • purchase • etc SQL-based Ad-hoc Queries SQL-based Dashboards DBs & Data Marts Other Apps Results Push Familiar & Table-oriented Infinite & Economical Cloud Data Store ✓App log data ✓Mobile event data ✓Sensor data ✓Telemetry Mobile SDKs Web SDK Multi-structured Events Multi-structured Events Multi-structured Events Multi-structured Events Agent Agent Agent Agent Agent Agent Agent Agent Embedded SDKs Server-side Agents
  • 6. Copyright ©2014 Treasure Data. All Rights Reserved. What is the point of all this data? BI Business Intelligence Using Very Large Sets of Data
  • 7.
  • 8. Copyright ©2015 Treasure Data. All Rights Reserved. Service Launched Series A Funding 100 Customers Selected by Gartner as Cool Vendor in Big Data 10 Trillion Records 5 Trillion Records Treasure Data By the Numbers (Jan-2015): 13T+ records of data imported since launch 500K+ records imported each second 1.5 Trillion+ records imported each month 12B records sent per day by one customer 13 Trillion Records Series B Funding Data Records Stored in the Treasure Data Cloud Service 0 3500000000000 7000000000000 10500000000000 14000000000000 Aug-12 Oct-12 Dec-12 Feb-13 Apr-13 Jun-13 Aug-13 Oct-13 Dec-13 Feb-14 Apr-14 Jun-14 Aug-14 Oct-14 Dec-14 8 Last 2 years
  • 9. Statistics Total Records Stored 25 Trillion Managed & Supported 24 * 7 * 365 Uptime 99.99% New Records / second 1 Million Daily Twitter volume 100x 1 0 1 1 0 0 0 1 0 1 1 1 0 0 0 0 1 24 / 7
  • 10. A solution? • There are trade-offs to consider • Any trade off should make it easy to collect data • Easy does it! un- and semi-structured data (multi- structured data) • Open source means it’s free; also means that you need someone on hand to maintain and implement • Cloud storage means you don’t have to scale and/or shard; tradeoff means performance hit against bare metal Image: John Hammink
  • 12. Images: Lightspring/Shutterstock, John Hammink, Treasure Data There are a few intro to Data Science blogs at blog.treasuredata.com!
  • 13. What does a pipeline need?
  • 14. Open vs. Closed source Image: Heather Craig/Shutterstock
  • 15. Images: PC World, Data-Hive, Wallpapersmela or or ?
  • 17. # logs from a file <source> type tail path /var/log/ httpd.log format apache2 tag web.access </source> # logs from client libraries <source> type forward port 24224 </source> # store logs to ES and HDFS <match *.*> type copy <store> type elasticsearch logstash_format
  • 20. Multi- structured data • un-structured data better for data for ultimate use in statistics
  • 23. an open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services embulk.org/docs
  • 24.
  • 25.
  • 26. Hivemall Hivemall is a scalable machine learning library that runs on Apache Hive. Hivemall is designed to be scalable to the number of training instances as well as the number of training features. • Classification • Regression • Recommendation • k-nearest neighbor • Anomaly Detection • Feature Engineering https://github.com/myui/hivemall
  • 27. The Hadoop Story on MongoDB Image courtesy of Steven Francia @ Docker