SlideShare a Scribd company logo
1 of 14
Data Analytic Technology Platforms
Options and Tradeoffs
J Singh
January 7, 2014
Do you have a “Big Data” problem?
• Or do you have a big “data problem”?

© DataThinks 2013-14
2

2
Some Big Data problems (1)
• Recommendations

© DataThinks 2013-14
3

3
Some Big Data problems (2)
• Financial Analysis
– Really Big Data if we want Real Time analysis

© DataThinks 2013-14
4

4
Some Big Data problems (3)
• Internet Infrastructure Security Monitoring

© DataThinks 2013-14
5

5
Other Big Data problems
• Network graph problems (Social Media data)
• Bioinformatics problems (Genomics data)
• Physics/engineering problems (Sensor data)
•…

• Key characteristics
1. Not much common between problems
2. Data too big to download or upload.
3. Data changes fast, requires near-real-time analysis.

© DataThinks 2013-14
6

6
Just Big “Data Problems” (JBDP)
• Most problems on Kaggle
• Popular data sets (e.g., Amazon, Kaggle, …, data sets)
– If it can be downloaded,
– If it doesn’t change very often, …
– It’s a JBDP

© DataThinks 2013-14
7

7
About us
• Technology and analytics service based on Big Data
problems, focused on small & medium companies
• Analytics products
– App Kinetics – Application analytics for servicing users
– Pop Kinetics – Population analytics for targeting prospects

© DataThinks 2013-14
8

8
Background for this talk
• Experience building the “Kinetics”
products
– Harvest the kinetic energy of
your data for the benefit of your
business 

• Prior work.
– Like-you: an application that

trolls through Facebook data to
find users who like the same
things you do

© DataThinks 2013-14
9

9
Governing Principle for Platform Choices
• Big Data is difficult to move
– If you can move it easily, how big can it really be?

• Processing needs to be brought closer to the data
– Moving the data to processing is a losing proposition.

• Connector solutions for a database won’t scale

© DataThinks 2013-14
10

10
Implications of the Governing Principle
• Architecture has to be optimized across the entire pipeline
– Lesson learned:
•
•
•
•

The architecture is a giant jig-saw puzzle
Best of breed solutions may not fit!
Importance of caching in the pipeline
Vendor lock-in may be inevitable

– Cost, Data Volume and Bandwidth are primary drivers

• Different stacks for different applications
– App Kinetics: MongoDB-based stack
– Pop Kinetics: S3, Elastic Map Reduce-based stack
– Similarities: Google App Engine, Google Map Reduce
© DataThinks 2013-14
11

11
Governing Principle in Action
Function
Data
Collection

App Kinetics

Pop Kinetics

Custom “probes”

Like-You

Facebook API

Facebook API

Data Storage MongoDB

Amazon S3

Google Datastore

Analysis

Mongo M/R (JS)
PyMongo
(Python)

Amazon EMR
(Hadoop, Python)

Google App
Engine M/R
(Python)

Visualization

HTML+D3 (JS)

Text

HTML+JS

Recommend
ations

Text

© DataThinks 2013-14
12

12
The decision-making process
• An iterative process (like solving a jig-saw puzzle)
– Not linear or formulaic

• What is the objective?

• About the data

– Discovery?

– Volume

• If there is a market?
• If the concept is
feasible?

–
–
–
–

• Rate of Growth

– Velocity
– Variety

Time to market?
Hitting a cost target?
A scalable solution?
Minimizing lock-in?

– Format
– Location, location, …

© DataThinks 2013-14
13
Thank you
• J Singh
– Principal, DataThinks
• j.singh@datathinks.org

– Adj. Prof, WPI

© DataThinks 2013-14
14

14

More Related Content

What's hot

Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
Qubole
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
Tu Pham
 

What's hot (20)

Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DB
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to Redshift
 
Visualizing big data in the browser using spark
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using spark
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
 
Hourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopHourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on Hadoop
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the Cloud
 

Viewers also liked

Сведения о педагогических кадрах
Сведения о педагогических кадрахСведения о педагогических кадрах
Сведения о педагогических кадрах
Владеймир Потанин
 
Harrow council Ref
Harrow council RefHarrow council Ref
Harrow council Ref
Fozia Tahir
 
Reference from the Administration of the President of the Republic of Bulgaria
Reference from the Administration of the President of the Republic of BulgariaReference from the Administration of the President of the Republic of Bulgaria
Reference from the Administration of the President of the Republic of Bulgaria
Georgi Zografov
 
Reference Letter - Frank AlvillarJr & Kevin Konst
Reference Letter - Frank AlvillarJr & Kevin KonstReference Letter - Frank AlvillarJr & Kevin Konst
Reference Letter - Frank AlvillarJr & Kevin Konst
Travis Byakeddy
 
Получение энергии из отходов Херсон
Получение энергии из отходов Херсон Получение энергии из отходов Херсон
Получение энергии из отходов Херсон
Pierre-Fran Beuchard
 
Inventory Valuation
Inventory ValuationInventory Valuation
Inventory Valuation
mscuttle
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh
 

Viewers also liked (13)

Take Care of Your Health
Take Care of Your HealthTake Care of Your Health
Take Care of Your Health
 
Сведения о педагогических кадрах
Сведения о педагогических кадрахСведения о педагогических кадрах
Сведения о педагогических кадрах
 
Présentation OTC- INVEST
Présentation OTC- INVESTPrésentation OTC- INVEST
Présentation OTC- INVEST
 
Policy and Institutional Approaches in Nutrition Sensitive Agriculture: Feedb...
Policy and Institutional Approaches in Nutrition Sensitive Agriculture: Feedb...Policy and Institutional Approaches in Nutrition Sensitive Agriculture: Feedb...
Policy and Institutional Approaches in Nutrition Sensitive Agriculture: Feedb...
 
Harrow council Ref
Harrow council RefHarrow council Ref
Harrow council Ref
 
Reference from the Administration of the President of the Republic of Bulgaria
Reference from the Administration of the President of the Republic of BulgariaReference from the Administration of the President of the Republic of Bulgaria
Reference from the Administration of the President of the Republic of Bulgaria
 
Reference Letter - Frank AlvillarJr & Kevin Konst
Reference Letter - Frank AlvillarJr & Kevin KonstReference Letter - Frank AlvillarJr & Kevin Konst
Reference Letter - Frank AlvillarJr & Kevin Konst
 
IABCSeattle Lara Feltin - Biznik - Social Media
IABCSeattle Lara Feltin - Biznik - Social MediaIABCSeattle Lara Feltin - Biznik - Social Media
IABCSeattle Lara Feltin - Biznik - Social Media
 
Получение энергии из отходов Херсон
Получение энергии из отходов Херсон Получение энергии из отходов Херсон
Получение энергии из отходов Херсон
 
Management of SAH
Management of SAHManagement of SAH
Management of SAH
 
Inventory Valuation
Inventory ValuationInventory Valuation
Inventory Valuation
 
KM Machinist
KM MachinistKM Machinist
KM Machinist
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
 

Similar to Data Analytic Technology Platforms: Options and Tradeoffs

Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
BigMine
 
Big Data and Fast Data – Big and Fast Combined, is it Possible?
Big Data and Fast Data – Big and Fast Combined, is it Possible?Big Data and Fast Data – Big and Fast Combined, is it Possible?
Big Data and Fast Data – Big and Fast Combined, is it Possible?
Guido Schmutz
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
Allen Day, PhD
 
5. big data vs it stki - pini cohen
5. big data vs  it    stki - pini cohen5. big data vs  it    stki - pini cohen
5. big data vs it stki - pini cohen
Taldor Group
 

Similar to Data Analytic Technology Platforms: Options and Tradeoffs (20)

Big data analytics presented at meetup big data for decision makers
Big data analytics presented at meetup big data for decision makersBig data analytics presented at meetup big data for decision makers
Big data analytics presented at meetup big data for decision makers
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
 
Big Data Infrastructure and Analytics Solution on FITAT2013
Big Data Infrastructure and Analytics Solution on FITAT2013Big Data Infrastructure and Analytics Solution on FITAT2013
Big Data Infrastructure and Analytics Solution on FITAT2013
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 
Big and fast data strategy 2017 jr
Big and fast data strategy 2017 jrBig and fast data strategy 2017 jr
Big and fast data strategy 2017 jr
 
Big Data Analytics Materials, Chapter: 1
Big Data Analytics Materials, Chapter: 1Big Data Analytics Materials, Chapter: 1
Big Data Analytics Materials, Chapter: 1
 
Big data session five ( a )f
Big data session five ( a )fBig data session five ( a )f
Big data session five ( a )f
 
Big Data and Fast Data – Big and Fast Combined, is it Possible?
Big Data and Fast Data – Big and Fast Combined, is it Possible?Big Data and Fast Data – Big and Fast Combined, is it Possible?
Big Data and Fast Data – Big and Fast Combined, is it Possible?
 
Top BI trends and predictions for 2017
Top BI trends and predictions for 2017Top BI trends and predictions for 2017
Top BI trends and predictions for 2017
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
Agile Leadership: Guiding DataOps Teams Through Rapid Change and UncertaintyAgile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
 
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
 
Innovating With Data and Analytics
Innovating With Data and AnalyticsInnovating With Data and Analytics
Innovating With Data and Analytics
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Top Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama SoftwareTop Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama Software
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-Purpose
 
5. big data vs it stki - pini cohen
5. big data vs  it    stki - pini cohen5. big data vs  it    stki - pini cohen
5. big data vs it stki - pini cohen
 

More from J Singh

PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
J Singh
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
J Singh
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
J Singh
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
J Singh
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
J Singh
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
J Singh
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
J Singh
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
J Singh
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
J Singh
 
CS 542 Introduction
CS 542 IntroductionCS 542 Introduction
CS 542 Introduction
J Singh
 
Cloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointCloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's Viewpoint
J Singh
 

More from J Singh (18)

PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
 
Facebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceFacebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/Reduce
 
Big Data Laboratory
Big Data LaboratoryBig Data Laboratory
Big Data Laboratory
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Social Media Mining using GAE Map Reduce
Social Media Mining using GAE Map ReduceSocial Media Mining using GAE Map Reduce
Social Media Mining using GAE Map Reduce
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data Analysis
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
 
CS 542 Introduction
CS 542 IntroductionCS 542 Introduction
CS 542 Introduction
 
Cloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointCloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's Viewpoint
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Data Analytic Technology Platforms: Options and Tradeoffs

  • 1. Data Analytic Technology Platforms Options and Tradeoffs J Singh January 7, 2014
  • 2. Do you have a “Big Data” problem? • Or do you have a big “data problem”? © DataThinks 2013-14 2 2
  • 3. Some Big Data problems (1) • Recommendations © DataThinks 2013-14 3 3
  • 4. Some Big Data problems (2) • Financial Analysis – Really Big Data if we want Real Time analysis © DataThinks 2013-14 4 4
  • 5. Some Big Data problems (3) • Internet Infrastructure Security Monitoring © DataThinks 2013-14 5 5
  • 6. Other Big Data problems • Network graph problems (Social Media data) • Bioinformatics problems (Genomics data) • Physics/engineering problems (Sensor data) •… • Key characteristics 1. Not much common between problems 2. Data too big to download or upload. 3. Data changes fast, requires near-real-time analysis. © DataThinks 2013-14 6 6
  • 7. Just Big “Data Problems” (JBDP) • Most problems on Kaggle • Popular data sets (e.g., Amazon, Kaggle, …, data sets) – If it can be downloaded, – If it doesn’t change very often, … – It’s a JBDP © DataThinks 2013-14 7 7
  • 8. About us • Technology and analytics service based on Big Data problems, focused on small & medium companies • Analytics products – App Kinetics – Application analytics for servicing users – Pop Kinetics – Population analytics for targeting prospects © DataThinks 2013-14 8 8
  • 9. Background for this talk • Experience building the “Kinetics” products – Harvest the kinetic energy of your data for the benefit of your business  • Prior work. – Like-you: an application that trolls through Facebook data to find users who like the same things you do © DataThinks 2013-14 9 9
  • 10. Governing Principle for Platform Choices • Big Data is difficult to move – If you can move it easily, how big can it really be? • Processing needs to be brought closer to the data – Moving the data to processing is a losing proposition. • Connector solutions for a database won’t scale © DataThinks 2013-14 10 10
  • 11. Implications of the Governing Principle • Architecture has to be optimized across the entire pipeline – Lesson learned: • • • • The architecture is a giant jig-saw puzzle Best of breed solutions may not fit! Importance of caching in the pipeline Vendor lock-in may be inevitable – Cost, Data Volume and Bandwidth are primary drivers • Different stacks for different applications – App Kinetics: MongoDB-based stack – Pop Kinetics: S3, Elastic Map Reduce-based stack – Similarities: Google App Engine, Google Map Reduce © DataThinks 2013-14 11 11
  • 12. Governing Principle in Action Function Data Collection App Kinetics Pop Kinetics Custom “probes” Like-You Facebook API Facebook API Data Storage MongoDB Amazon S3 Google Datastore Analysis Mongo M/R (JS) PyMongo (Python) Amazon EMR (Hadoop, Python) Google App Engine M/R (Python) Visualization HTML+D3 (JS) Text HTML+JS Recommend ations Text © DataThinks 2013-14 12 12
  • 13. The decision-making process • An iterative process (like solving a jig-saw puzzle) – Not linear or formulaic • What is the objective? • About the data – Discovery? – Volume • If there is a market? • If the concept is feasible? – – – – • Rate of Growth – Velocity – Variety Time to market? Hitting a cost target? A scalable solution? Minimizing lock-in? – Format – Location, location, … © DataThinks 2013-14 13
  • 14. Thank you • J Singh – Principal, DataThinks • j.singh@datathinks.org – Adj. Prof, WPI © DataThinks 2013-14 14 14