SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
Analyzing the power of Tweets 
in predicting Commodity Futures 
Mar 17, 2014 
@gopivotal @being_bayesian 
Srivatsan Ramanujam 
Senior Data Scientist 
Pivotal 
© Copyright 2013 Pivotal. All rights reserved. 1
Problem Definition 
Ÿ Can we predict Corn, Soybean and Wheat futures based on Social Chatter on Twitter ? 
Ÿ The Customer: A major Agricultural Cooperative 
@gopivotal @being_bayesian 
© Copyright 2013 Pivotal. All rights reserved. 2
@gopivotal @being_bayesian 
Data 
© Copyright 2013 Pivotal. All rights reserved. 3
Obtaining Data 
Ÿ Used to fetch 5-years of historical tweets matching any of a list of keywords of interest 
Tweets Table Poster Information 
@gopivotal @being_bayesian 
© Copyright 2013 Pivotal. All rights reserved. 4
GNIP 
@gopivotal @being_bayesian 
Ÿ As plugged-in partners, we’ve worked with 
GNIP before, experience was great! 
Ÿ We needed historical data and GNIP’s 
Historical PowerTrack came in handy 
Ÿ Clean API, quick quotes, convenient to 
download results of historical jobs 
© Copyright 2013 Pivotal. All rights reserved. 5
Grain Futures Vs. Volume of Tweets 
@gopivotal @being_bayesian 
© Copyright 2013 Pivotal. All rights reserved. 6
The Platform 
@gopivotal @being_bayesian 
© Copyright 2013 Pivotal. All rights reserved. 7
Data Science Toolkit 
Ÿ Appliance 
– Full Rack DCA with Greenplum Database 
Ÿ ETL 
– Python 
Ÿ Modeling 
– SQL 
– MADlib 
– PL/Python, PL/Java 
– Ark-Tweet-NLP1 with PL/Java Wrappers 
Ÿ Visualization 
– Tableau 
1CMU ARK Twitter Parts-of-Speech tagger : http://www.ark.cs.cmu.edu/TweetNLP (GPL 2) 
@gopivotal @being_bayesian 
© Copyright 2013 Pivotal. All rights reserved. 8
Pivotal Greenplum MPP DB 
@gopivotal @being_bayesian 
Think of it as multiple 
PostGreSQL servers 
Master 
Segments/Workers 
Rows are distributed across segments by 
a particular field (or randomly) 
© Copyright 2013 Pivotal. All rights reserved. 9
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.} 
• Allows users to write Greenplum/ 
PostgreSQL functions in the R/Python/ 
Java, Perl, pgsql or C languages Standby 
Ÿ The interpreter/VM of the language ‘X’ is 
installed on each node of the Greenplum 
Database Cluster 
• Data Parallelism: 
- PL/X piggybacks on 
Greenplum’s MPP architecture 
@gopivotal @being_bayesian 
Master 
Segment Host 
Segment 
Segment 
… 
Master 
Host 
SQL 
Interconnect 
Segment Host 
Segment 
Segment 
Segment Host 
Segment 
Segment 
Segment Host 
Segment 
Segment 
© Copyright 2013 Pivotal. All rights reserved. 10
Scalable, in-database ML 
• Open Source!https://github.com/madlib/madlib 
• Works on Greenplum DB and PostgreSQL 
• Active development by Pivotal 
• Downloads and Docs: http://madlib.net/ 
@gopivotal @being_bayesian 
- Latest Release : 1.4 (Dec 2014) 
© Copyright 2013 Pivotal. All rights reserved. 11
MADlib In-Database 
Functions 
Predictive Modeling Library 
Generalized Linear Models 
• Linear Regression 
• Logistic Regression 
• Multinomial Logistic Regression 
• Cox Proportional Hazards 
• Regression 
• Elastic Net Regularization 
• Sandwich Estimators (Huber white, 
clustered, marginal effects) 
Matrix Factorization 
• Single Value Decomposition (SVD) 
• Low-Rank 
@gopivotal @being_bayesian 
Machine Learning Algorithms 
• Principal Component Analysis (PCA) 
• Association Rules (Affinity Analysis, Market 
Basket) 
• Topic Modeling (Parallel LDA) 
• Decision Trees 
• Ensemble Learners (Random Forests) 
• Support Vector Machines 
• Conditional Random Field (CRF) 
• Clustering (K-means) 
• Cross Validation 
Linear Systems 
• Sparse and Dense Solvers 
Descriptive Statistics 
Sketch-based Estimators 
• CountMin (Cormode- 
Muthukrishnan) 
• FM (Flajolet-Martin) 
• MFV (Most Frequent 
Values) 
Correlation 
Summary 
Support Modules 
Array Operations 
Sparse Vectors 
Random Sampling 
Probability Functions 
© Copyright 2013 Pivotal. All rights reserved. 12
@gopivotal @being_bayesian 
The Models 
© Copyright 2013 Pivotal. All rights reserved. 13
The Approach 
• In addition to identifying textual cues in tweets that were correlated with 
commodity futures, we also wanted to analyze whether tweet sentiment was 
correlated with commodity futures 
@gopivotal @being_bayesian 
© Copyright 2013 Pivotal. All rights reserved. 14
Sentiment Analysis – Challenges 
Ÿ Language on Twitter doesn’t 
adhere to rules of grammar, syntax 
or spelling 
Ÿ We don’t have labeled data for our 
problem. The tweets aren’t tagged 
with sentiment 
Ÿ Semi-Supervised Sentiment 
Prediction can be achieved by 
dictionary look-ups of tokens in a 
Tweet, but without Context, 
Sentiment Prediction is futile! 
@gopivotal @being_bayesian 
“Cool” 
© Copyright 2013 Pivotal. All rights reserved. 15
Sentiment Analysis – Approach 
Ÿ Parallelized ArkTweetNLP to achieve fast parts-of-speech tagging on Tweets 
Ÿ Custom (patent pending) algorithm to extract contextual cues & score sentiment of tweets 
Semi-Supervised Sentiment Classification 
Phrase Extraction 
Break-up Tweets into 
tokens and tag their 
parts-of-speech 
Part-of-speech 
tagger1 
1: Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/) 
@gopivotal @being_bayesian 
Phrasal Polarity 
Scoring 
Use learned phrasal 
polarities to score 
sentiment of new tweets 
Sentiment Scored 
Tweets 
© Copyright 2013 Pivotal. All rights reserved. 16
Text Analytics Pipeline with GNIP stream 
Tweet 
Stream 
Stored on 
HDFS 
(gpfdist) 
Loaded as 
external tables 
into GPDB 
Parallel Parsing of 
JSON and extraction 
of fields using PL/ 
Python 
@gopivotal @being_bayesian 
Topic Analysis through 
MADlib pLDA 
Sentiment Analysis 
through custom 
PL/Python functions 
D3.js 
© Copyright 2013 Pivotal. All rights reserved. 17
Key Take-Aways 
There is significant signal in Tweets in predicting commodity futures 
Sentiment Analysis of tweets can provide an additional signal in predicting 
commodity futures. Twitter sentiment was negatively correlated with commodity 
futures, in the sample we analyzed 
A blended model of Text Regression, Sentiment Analysis and Tweet Actor 
information gave us encouraging results and we believe that when combined 
with market fundamentals like weather or yield will give better models 
@gopivotal @being_bayesian 
© Copyright 2013 Pivotal. All rights reserved. 18
What’s in it for me? 
@gopivotal @being_bayesian 
© Copyright 2013 Pivotal. All rights reserved. 19
Pivotal Open Source Contributions 
http://gopivotal.com/pivotal-products/open-source-software 
• MADlib – In-database parallel ML 
- https://github.com/madlib/madlib 
• PyMADlib – Python Wrapper for MADlib 
- https://github.com/gopivotal/pymadlib 
• PivotalR – R wrapper for MADlib 
- https://github.com/madlib-internal/PivotalR 
• Part-of-speech tagger for Twitter via SQL 
- http://vatsan.github.io/gp-ark-tweet-nlp/ 
Questions? 
@being_bayesian 
@gopivotal @being_bayesian 
© Copyright 2013 Pivotal. All rights reserved. 20

Mais conteúdo relacionado

Mais procurados

MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRPivotalOpenSourceHub
 
Graph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphGraph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphTigerGraph
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library EMC
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with HadoopSangchul Song
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018TigerGraph
 
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AIGraph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AITigerGraph
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
 
Plume - A Code Property Graph Extraction and Analysis Library
Plume - A Code Property Graph Extraction and Analysis LibraryPlume - A Code Property Graph Extraction and Analysis Library
Plume - A Code Property Graph Extraction and Analysis LibraryTigerGraph
 
Graph Gurus Episode 12: Tiger Graph v2.3 Overview
Graph Gurus Episode 12: Tiger Graph v2.3 OverviewGraph Gurus Episode 12: Tiger Graph v2.3 Overview
Graph Gurus Episode 12: Tiger Graph v2.3 OverviewTigerGraph
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016MLconf
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApachePivotalOpenSourceHub
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
 
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...TigerGraph
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...TigerGraph
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 

Mais procurados (20)

MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
Graph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphGraph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise Graph
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with Hadoop
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018
 
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AIGraph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
Plume - A Code Property Graph Extraction and Analysis Library
Plume - A Code Property Graph Extraction and Analysis LibraryPlume - A Code Property Graph Extraction and Analysis Library
Plume - A Code Property Graph Extraction and Analysis Library
 
Graph Gurus Episode 12: Tiger Graph v2.3 Overview
Graph Gurus Episode 12: Tiger Graph v2.3 OverviewGraph Gurus Episode 12: Tiger Graph v2.3 Overview
Graph Gurus Episode 12: Tiger Graph v2.3 Overview
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to Apache
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 

Semelhante a Analyzing Power of Tweets in Predicting Commodity Futures

Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesIan Huston
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet SentimentLucinda Linde
 
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...Designing a Generative AI QnA solution with Proprietary Enterprise Business K...
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...IRJET Journal
 
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 DatasetGraph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 DatasetTigerGraph
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Makoto Yui
 
ChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdfChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdfSonal Tiwari
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemallMakoto Yui
 
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsKamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsGreg Makowski
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...PyData
 
Greenplum Database Open Source December 2015
Greenplum Database Open Source December 2015Greenplum Database Open Source December 2015
Greenplum Database Open Source December 2015PivotalOpenSourceHub
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsNick Pentreath
 
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupScott Mitchell
 
Implementing a highly scalable stock prediction system with R, Geode, SpringX...
Implementing a highly scalable stock prediction system with R, Geode, SpringX...Implementing a highly scalable stock prediction system with R, Geode, SpringX...
Implementing a highly scalable stock prediction system with R, Geode, SpringX...William Markito Oliveira
 
Introduction To R
Introduction To RIntroduction To R
Introduction To RSpotle.ai
 
What is Chatgpt Complete Guide
What is Chatgpt Complete GuideWhat is Chatgpt Complete Guide
What is Chatgpt Complete GuideRavendra Singh
 
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityUsing Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityTigerGraph
 
Graph Gurus Episode 29: Using Graph Algorithms for Advanced Analytics Part 3
Graph Gurus Episode 29: Using Graph Algorithms for Advanced Analytics Part 3Graph Gurus Episode 29: Using Graph Algorithms for Advanced Analytics Part 3
Graph Gurus Episode 29: Using Graph Algorithms for Advanced Analytics Part 3TigerGraph
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big dataTrieu Nguyen
 

Semelhante a Analyzing Power of Tweets in Predicting Commodity Futures (20)

Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet Sentiment
 
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...Designing a Generative AI QnA solution with Proprietary Enterprise Business K...
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...
 
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 DatasetGraph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17
 
ChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdfChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdf
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
 
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsKamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
 
Greenplum Database Open Source December 2015
Greenplum Database Open Source December 2015Greenplum Database Open Source December 2015
Greenplum Database Open Source December 2015
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
 
Implementing a highly scalable stock prediction system with R, Geode, SpringX...
Implementing a highly scalable stock prediction system with R, Geode, SpringX...Implementing a highly scalable stock prediction system with R, Geode, SpringX...
Implementing a highly scalable stock prediction system with R, Geode, SpringX...
 
Introduction To R
Introduction To RIntroduction To R
Introduction To R
 
What is Chatgpt Complete Guide
What is Chatgpt Complete GuideWhat is Chatgpt Complete Guide
What is Chatgpt Complete Guide
 
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityUsing Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
 
Graph Gurus Episode 29: Using Graph Algorithms for Advanced Analytics Part 3
Graph Gurus Episode 29: Using Graph Algorithms for Advanced Analytics Part 3Graph Gurus Episode 29: Using Graph Algorithms for Advanced Analytics Part 3
Graph Gurus Episode 29: Using Graph Algorithms for Advanced Analytics Part 3
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 

Último

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 

Último (20)

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 

Analyzing Power of Tweets in Predicting Commodity Futures

  • 1. Analyzing the power of Tweets in predicting Commodity Futures Mar 17, 2014 @gopivotal @being_bayesian Srivatsan Ramanujam Senior Data Scientist Pivotal © Copyright 2013 Pivotal. All rights reserved. 1
  • 2. Problem Definition Ÿ Can we predict Corn, Soybean and Wheat futures based on Social Chatter on Twitter ? Ÿ The Customer: A major Agricultural Cooperative @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 2
  • 3. @gopivotal @being_bayesian Data © Copyright 2013 Pivotal. All rights reserved. 3
  • 4. Obtaining Data Ÿ Used to fetch 5-years of historical tweets matching any of a list of keywords of interest Tweets Table Poster Information @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 4
  • 5. GNIP @gopivotal @being_bayesian Ÿ As plugged-in partners, we’ve worked with GNIP before, experience was great! Ÿ We needed historical data and GNIP’s Historical PowerTrack came in handy Ÿ Clean API, quick quotes, convenient to download results of historical jobs © Copyright 2013 Pivotal. All rights reserved. 5
  • 6. Grain Futures Vs. Volume of Tweets @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 6
  • 7. The Platform @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 7
  • 8. Data Science Toolkit Ÿ Appliance – Full Rack DCA with Greenplum Database Ÿ ETL – Python Ÿ Modeling – SQL – MADlib – PL/Python, PL/Java – Ark-Tweet-NLP1 with PL/Java Wrappers Ÿ Visualization – Tableau 1CMU ARK Twitter Parts-of-Speech tagger : http://www.ark.cs.cmu.edu/TweetNLP (GPL 2) @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 8
  • 9. Pivotal Greenplum MPP DB @gopivotal @being_bayesian Think of it as multiple PostGreSQL servers Master Segments/Workers Rows are distributed across segments by a particular field (or randomly) © Copyright 2013 Pivotal. All rights reserved. 9
  • 10. PL/X : X in {pgsql, R, Python, Java, Perl, C etc.} • Allows users to write Greenplum/ PostgreSQL functions in the R/Python/ Java, Perl, pgsql or C languages Standby Ÿ The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster • Data Parallelism: - PL/X piggybacks on Greenplum’s MPP architecture @gopivotal @being_bayesian Master Segment Host Segment Segment … Master Host SQL Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment © Copyright 2013 Pivotal. All rights reserved. 10
  • 11. Scalable, in-database ML • Open Source!https://github.com/madlib/madlib • Works on Greenplum DB and PostgreSQL • Active development by Pivotal • Downloads and Docs: http://madlib.net/ @gopivotal @being_bayesian - Latest Release : 1.4 (Dec 2014) © Copyright 2013 Pivotal. All rights reserved. 11
  • 12. MADlib In-Database Functions Predictive Modeling Library Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards • Regression • Elastic Net Regularization • Sandwich Estimators (Huber white, clustered, marginal effects) Matrix Factorization • Single Value Decomposition (SVD) • Low-Rank @gopivotal @being_bayesian Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Affinity Analysis, Market Basket) • Topic Modeling (Parallel LDA) • Decision Trees • Ensemble Learners (Random Forests) • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation Linear Systems • Sparse and Dense Solvers Descriptive Statistics Sketch-based Estimators • CountMin (Cormode- Muthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions © Copyright 2013 Pivotal. All rights reserved. 12
  • 13. @gopivotal @being_bayesian The Models © Copyright 2013 Pivotal. All rights reserved. 13
  • 14. The Approach • In addition to identifying textual cues in tweets that were correlated with commodity futures, we also wanted to analyze whether tweet sentiment was correlated with commodity futures @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 14
  • 15. Sentiment Analysis – Challenges Ÿ Language on Twitter doesn’t adhere to rules of grammar, syntax or spelling Ÿ We don’t have labeled data for our problem. The tweets aren’t tagged with sentiment Ÿ Semi-Supervised Sentiment Prediction can be achieved by dictionary look-ups of tokens in a Tweet, but without Context, Sentiment Prediction is futile! @gopivotal @being_bayesian “Cool” © Copyright 2013 Pivotal. All rights reserved. 15
  • 16. Sentiment Analysis – Approach Ÿ Parallelized ArkTweetNLP to achieve fast parts-of-speech tagging on Tweets Ÿ Custom (patent pending) algorithm to extract contextual cues & score sentiment of tweets Semi-Supervised Sentiment Classification Phrase Extraction Break-up Tweets into tokens and tag their parts-of-speech Part-of-speech tagger1 1: Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/) @gopivotal @being_bayesian Phrasal Polarity Scoring Use learned phrasal polarities to score sentiment of new tweets Sentiment Scored Tweets © Copyright 2013 Pivotal. All rights reserved. 16
  • 17. Text Analytics Pipeline with GNIP stream Tweet Stream Stored on HDFS (gpfdist) Loaded as external tables into GPDB Parallel Parsing of JSON and extraction of fields using PL/ Python @gopivotal @being_bayesian Topic Analysis through MADlib pLDA Sentiment Analysis through custom PL/Python functions D3.js © Copyright 2013 Pivotal. All rights reserved. 17
  • 18. Key Take-Aways There is significant signal in Tweets in predicting commodity futures Sentiment Analysis of tweets can provide an additional signal in predicting commodity futures. Twitter sentiment was negatively correlated with commodity futures, in the sample we analyzed A blended model of Text Regression, Sentiment Analysis and Tweet Actor information gave us encouraging results and we believe that when combined with market fundamentals like weather or yield will give better models @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 18
  • 19. What’s in it for me? @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 19
  • 20. Pivotal Open Source Contributions http://gopivotal.com/pivotal-products/open-source-software • MADlib – In-database parallel ML - https://github.com/madlib/madlib • PyMADlib – Python Wrapper for MADlib - https://github.com/gopivotal/pymadlib • PivotalR – R wrapper for MADlib - https://github.com/madlib-internal/PivotalR • Part-of-speech tagger for Twitter via SQL - http://vatsan.github.io/gp-ark-tweet-nlp/ Questions? @being_bayesian @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 20