SlideShare a Scribd company logo
1 of 38
Download to read offline
#CASSANDRA13
Ken	
  Krugler	
  |	
  President,	
  Scale	
  Unlimited
Suicide Prevention Using Social Media and Cassandra
#CASSANDRA13
What we will discuss today...
*Using Cassandra to store social media content
*Combining Hadoop workflows with Cassandra
*Leveraging Solr search support in DataStax Enterprise
*Doing good with big data
This material is based upon work supported by the Defense Advance Research Project Agency (DARPA),
and Space Warfare Systems Center Pacific under Contract N66001-11-4006. Any opinions, findings, and
conclusions or recommendations expressed in this material are those of the authors(s) and do not
necessarily reflect the views of the Defense Advance Research Program Agency (DARPA) and Space and
Naval Warfare Systems Center Pacific.
Fine Print!
#CASSANDRA13
Obligatory Background
*Ken Krugler, Scale Unlimited - Nevada City, CA
*Consulting on big data workflows, machine learning & search
*Training for Hadoop, Cascading, Solr & Cassandra
#CASSANDRA13
Durkheim Project Overview
Including things we didn't work on...
#CASSANDRA13
What's the problem?
*More soldiers die from suicide than combat
*Suicide rate has gone up 80% since 2002
*Civilian suicide rates are also climbing
*More suicides than homicides
*Intervention after an "event" is often too late
Graph of suicide rates
#CASSANDRA13
What is The Durkheim Project?
*DARPA-funded initiative to
help military physicians
*Uses predictive analytics to
estimate suicide risk from
what people write online
*Each user is assigned a
suicidality risk rating of red,
yellow or green.
Émile Durkheim
Named after Emile Durkheim, late 1800's sociologist who first used text analytics to help define suicide risk.
#CASSANDRA13
Current Status of Durkheim
*Collaborative effort involving Patterns and Predictions, Dartmouth
Medical School & Facebook
*Details at http://www.durkheimproject.org/
*Finished phase I, now being rolled out to wider audience
Patterns and Predictions has its background expertise in predicting financial market events and trends from news, which led to the development of the predictive
models used in Durkheim
#CASSANDRA13
Predictive Analytics
*Guessing at state of mind from text
-"There are very few people in this world that know the REAL me."
-"I lay down to go to sleep, but all I can do is cry"
*Uses labeled training data from clinical notes
*Phase I results promising, for small sample set
-"ensemble" of predictors is a powerful ML technique
#CASSANDRA13
Clinician Dashboard
*Multiple views on patient
*Prediction & confidence
*Backing data (key phrases, etc)
So this is the goal - give medical staff indications of who they should be most concerned about.
#CASSANDRA13
Data Collection
Where _do_ you put a billion text snippets?
The previous section was the project overview, which was work done by others in the project.
Now we get to the part that we worked, which involves Cassandra
#CASSANDRA13
Saving Social Media Activity
*System to continuous save new activity
-Scalable data store
*Also needs a scalable, reliable way to access data
-Processed in bulk (workflows)
-Accessed at individual level
-Searched at activity level
For the current size of the project, MySQL would be just fine.
But we want an architecture that can scale if/when the project is rolled out to everyone
#CASSANDRA13
Data Collection
*Pink is what we wrote
*Green is in Cassandra
*Key data path in red
Exciting Social
Media Activity
Gigya
Daemon
Durkheim
Social API
Users
Table
Durkheim
App
Gigya
Service
Activity
Table
#CASSANDRA13
Designing the Column Families
*What queries do we need to handle?
-Always by user id (what we assign)
*We want all the data for a user
-Both for Users table, and Activities table
-Sometimes we want a date range of activities
*So one row per user
-And ordered by date in the Activities table
#CASSANDRA13
Users Table (Column Family)
*One row per user - row key is a UUID we assign
*Standard "static" columns
-First name, last name, opt_in status, etc.
*Easy to add more xxx_id columns for new services
row key first_name last_name facebook_id twitter_id opt_in
#CASSANDRA13
Activities Table (Column Family)
*One row per user - row key is a UUID we assign
*One composite column per social media event
-Timestamp (long value)
-Source (FB, TW, GP, etc)
-Type of column (data, activity id, user id, type of activity)
row key ts_src_data ts_src_id ts_src_providerUid ts_src_type
Remember we wanted to get slices of data by date?
So we use timestamp as the first (primary) ordering for the columns.
We can use regular millisecond timestamp since it's for one user, assume we don't get multiple entries.
#CASSANDRA13
Two Views of Composite Columns
*As a row/column view
*As a key-value map 213_FB_data
213_FB_id
213_FB_providerUid
213_FB_type
"I feel tired"
"FB post #32"
"FB user #66"
"Status update"
"uuid1"
"uuid1" 213_FB_data 213_FB_id 213_FB_providerUid 213_FB_type
"I feel tired" "FB post #32" "FB user #66" "Status update"
#CASSANDRA13
Implementation Details
*API access protected via signature
*Gigya Daemon on both t1.micro servers
-But only active on one of them
*Astyanax client talks to Cassandra
*Cluster uses 3 m1.large servers
Durkheim
Social API
Durkheim
App
AWS Load
Balancer
EC2 m1.large
servers
Durkheim
Social API
EC2 t1.micro
servers
#CASSANDRA13
Predictive Analytics at Scale
Running workflows against Cassandra data
#CASSANDRA13
How to process all this social media goodness?
*Models are defined elsewhere
*These are "black boxes" to us
213_FB_data 213_FB_id 213_FB_providerUid 213_FB_type
"I feel tired" "FB post #32" "FB user #66" "Status update"
307_TW_data 307_TW_id 307_TW_providerUid 307_TW_type
"Where am I?" "Tweet #17" "TW user #109" "Tweet"
Feature
Extraction
Model
model rating probability keywords
Models are data used by PA engine to generate scores
We do not have or want access to the data used to generate the models
Generating the model is often NOT something that needs scalability
Amount of labeled data is typically pretty small.
Training often works best on a single server.
#CASSANDRA13
Why do we need Hadoop?
*Running one model on one user is easy
-And n models on one user is still OK
*But when a model changes...
-all users with the model need processing
Models can change frequently
And when a user changes...
- that user with all models needs processing
- adding/removing models is also a change
#CASSANDRA13
Batch processing is OK
*No strict minimum latency requirements
*So we use Hadoop, for scalability and reliability
#CASSANDRA13
Hadoop Workflow Details
*Implemented using Cascading
*Read Activities Table using Cassandra Tap
*Read models from MySQL via JDBC
#CASSANDRA13
Hadoop Bulk Classification Workflow
Convert to Cassandra
Write Classification Result Table
Run Classifier models
CoGroup by user profile ID
Convert from Cassandra
Read User Profiles Table
Convert from Cassandra
Read Social Media Activity Table
Separate from this, we've loaded the models into memory and serialized them with the classification step
This is all done using Cascading to define the workflow.
#CASSANDRA13
Workflow Issues
*Currently manual operation
-Ultimately needs a daemon to trigger (time, users, models)
*Runs in separate cluster
-Lots of network activity to pull data from Cassandra cluster
-With DSE we could run on same cluster
*Fun with AWS security groups
#CASSANDRA13
Solr Search
Poking at the data
#CASSANDRA13
Solr Search
*Model results include key terms for classification result
-"feel angry" (0.732)
*Now you want to check actual usage of these terms
Maybe actual text was "I don't feel angry when my wifi connection drops".
#CASSANDRA13
Poking at the Data
*Hadoop turns petabytes into
pie-charts
*How do you verify results?
*Search works really well here
Maybe before you'd use a spreadsheet printout to argue.
But that would be Satan's Spreadsheet with billions of rows.
#CASSANDRA13
Solr Search
*Want "narrow" table for search
-Solr dynamic fields are usually not a great idea
-Limit to 1024 dynamic fields per document
*So we'll replicate some of our Activity CF data into a new CF
*Don't be afraid of making copies of data
#CASSANDRA13
The "Search" Column Family
*Row key is derived from Activity CF UUID + target column name
*One column ("data") has content from that row + column in Activity CF
row key "data"
"uuid1_213_FB "I feel tired"
"uuid1" 213_FB_data 213_FB_id
"I feel tired" "FB post #32"
Activity Column Family
Search Column Family
#CASSANDRA13
Solr Schema
*Very simple (which is how we like it)
*Direct one-to-one mapping with Cassandra columns
*Hits have key field, which contains UUID/Timestamp/Service
<fields>
<field name="key" type="string" indexed="true" stored="true" />
<field name="data" type="text" indexed="true" stored="true" />
</fields>
So once we have a hit, we can access information in activity table if needed.
#CASSANDRA13
Combined Cluster
*One Cassandra Cluster can allocate nodes for Hadoop & Search
#CASSANDRA13
Security
Locking things down
#CASSANDRA13
The Most Important Detail
*We don't have any personal medical data!!!
*We don't have any personal medical data!!!
*We don't have any personal medical data!!!
As soon as you've got personal medical data, it's a whole new ballgame.
At least an order of magnitude more work to make it really secure.
Likely that you couldn't use AWS cloud
We still care about security, because we're collecting social media activity that isn't necessarily public.
#CASSANDRA13
Three Aspects of Security
*Server-level
-ssh via restricted private key
*API-level
-validate requests using signature
-secure SHA1 hash
*Services-level
-Restrict open ports using security groups
So even if you knew which server was running OpsCenter, you couldn't just start poking around.
Access to Cassandra is only via t1.micro servers, which are in same security group
t1.micro servers only open up ssh and port needed for external API request
[include picture?]
#CASSANDRA13
Summary
Bringing it all home
#CASSANDRA13
*You can effectively use Cassandra as:
A repository for social media data
The data source for workflows
A search index, via Solr integration
Key Points...
#CASSANDRA13
*It is possible to do more with big data than optimize ad yields
And the Meta-Point
#CASSANDRA13
THANK YOU

More Related Content

What's hot

Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)SahilRaina21
 
Apache Cassandra overview
Apache Cassandra overviewApache Cassandra overview
Apache Cassandra overviewElifTech
 
Key-Value Stores: a practical overview
Key-Value Stores: a practical overviewKey-Value Stores: a practical overview
Key-Value Stores: a practical overviewMarc Seeger
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
Introduction to NoSQL with MongoDB
Introduction to NoSQL with MongoDBIntroduction to NoSQL with MongoDB
Introduction to NoSQL with MongoDBHector Correa
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and CassandraDataStax Academy
 
No sql distilled-distilled
No sql distilled-distilledNo sql distilled-distilled
No sql distilled-distilledrICh morrow
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3RojaT4
 
Cassandra advanced-I
Cassandra advanced-ICassandra advanced-I
Cassandra advanced-Iachudhivi
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonGrisha Weintraub
 
Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....
Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....
Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....QBiC_Tue
 
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortalsChapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortalsnehabsairam
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...QBiC_Tue
 

What's hot (20)

Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Apache Cassandra overview
Apache Cassandra overviewApache Cassandra overview
Apache Cassandra overview
 
Key-Value Stores: a practical overview
Key-Value Stores: a practical overviewKey-Value Stores: a practical overview
Key-Value Stores: a practical overview
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Introduction to NoSQL with MongoDB
Introduction to NoSQL with MongoDBIntroduction to NoSQL with MongoDB
Introduction to NoSQL with MongoDB
 
NoSQL
NoSQLNoSQL
NoSQL
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and Cassandra
 
Ddn 2017 11_apriori
Ddn 2017 11_aprioriDdn 2017 11_apriori
Ddn 2017 11_apriori
 
Why Cassandra?
Why Cassandra?Why Cassandra?
Why Cassandra?
 
Cassandra
CassandraCassandra
Cassandra
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
 
No sql distilled-distilled
No sql distilled-distilledNo sql distilled-distilled
No sql distilled-distilled
 
NOSQL vs SQL
NOSQL vs SQLNOSQL vs SQL
NOSQL vs SQL
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Cassandra advanced-I
Cassandra advanced-ICassandra advanced-I
Cassandra advanced-I
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and Comparison
 
Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....
Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....
Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....
 
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortalsChapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
 

Viewers also liked

Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, FasterKen Krugler
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scaleKen Krugler
 
Virtual guess the price difference
Virtual guess the price differenceVirtual guess the price difference
Virtual guess the price differencePaul Carpenter
 
McAfee Content Security Solutions
McAfee Content Security SolutionsMcAfee Content Security Solutions
McAfee Content Security SolutionsAndrei Novikau
 
Программа поддержки экспорта инновационной и высокотехнологичной продукции, р...
Программа поддержки экспорта инновационной и высокотехнологичной продукции, р...Программа поддержки экспорта инновационной и высокотехнологичной продукции, р...
Программа поддержки экспорта инновационной и высокотехнологичной продукции, р...Ingria. Technopark St. Petersburg
 
Staying Ahead of the Game - The Steps to Effective Crisis Communications Plan...
Staying Ahead of the Game - The Steps to Effective Crisis Communications Plan...Staying Ahead of the Game - The Steps to Effective Crisis Communications Plan...
Staying Ahead of the Game - The Steps to Effective Crisis Communications Plan...prnewswire
 
Mengarang Slide Untuk Mahasiswa Kebumen Di Ui
Mengarang Slide Untuk Mahasiswa Kebumen Di UiMengarang Slide Untuk Mahasiswa Kebumen Di Ui
Mengarang Slide Untuk Mahasiswa Kebumen Di Uixardaz
 
Cinque Terre, Italy
Cinque  Terre,  ItalyCinque  Terre,  Italy
Cinque Terre, Italyrvankley
 
Революция без крови
 Революция без крови Революция без крови
Революция без кровиVadim Zhartun
 
Разработка современной электроники с прицелом на массовый выпуск. На чем?
Разработка современной электроники с прицелом на массовый выпуск. На чем?Разработка современной электроники с прицелом на массовый выпуск. На чем?
Разработка современной электроники с прицелом на массовый выпуск. На чем?Ingria. Technopark St. Petersburg
 
Unit 7d Consolidating superannuation
Unit 7d Consolidating superannuationUnit 7d Consolidating superannuation
Unit 7d Consolidating superannuationAndrew Hingston
 
Разработка современной электроники с прицелом на массовый выпуск. Почем?
Разработка современной электроники с прицелом на массовый выпуск. Почем?Разработка современной электроники с прицелом на массовый выпуск. Почем?
Разработка современной электроники с прицелом на массовый выпуск. Почем?Ingria. Technopark St. Petersburg
 
The Night We Started Dancing
The Night We Started DancingThe Night We Started Dancing
The Night We Started DancingAnna Donskoy
 
WTR Club OPP
WTR Club OPPWTR Club OPP
WTR Club OPPwaytorich
 
TWTRCON DC 09 Recruiting
TWTRCON DC 09 RecruitingTWTRCON DC 09 Recruiting
TWTRCON DC 09 RecruitingEdelman
 
I T Tjej Liselotte Norén 20100325
I T Tjej  Liselotte  Norén 20100325I T Tjej  Liselotte  Norén 20100325
I T Tjej Liselotte Norén 20100325Mongara AB
 
Powerpoint Fiesta Jesus Elices 6 Horas @ Sala Versus (20 02 2010)
Powerpoint Fiesta Jesus Elices 6 Horas @ Sala Versus (20 02 2010)Powerpoint Fiesta Jesus Elices 6 Horas @ Sala Versus (20 02 2010)
Powerpoint Fiesta Jesus Elices 6 Horas @ Sala Versus (20 02 2010)RAZORDJ
 

Viewers also liked (20)

Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, Faster
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
 
Bai Tap
Bai TapBai Tap
Bai Tap
 
Virtual guess the price difference
Virtual guess the price differenceVirtual guess the price difference
Virtual guess the price difference
 
McAfee Content Security Solutions
McAfee Content Security SolutionsMcAfee Content Security Solutions
McAfee Content Security Solutions
 
Программа поддержки экспорта инновационной и высокотехнологичной продукции, р...
Программа поддержки экспорта инновационной и высокотехнологичной продукции, р...Программа поддержки экспорта инновационной и высокотехнологичной продукции, р...
Программа поддержки экспорта инновационной и высокотехнологичной продукции, р...
 
Staying Ahead of the Game - The Steps to Effective Crisis Communications Plan...
Staying Ahead of the Game - The Steps to Effective Crisis Communications Plan...Staying Ahead of the Game - The Steps to Effective Crisis Communications Plan...
Staying Ahead of the Game - The Steps to Effective Crisis Communications Plan...
 
Mengarang Slide Untuk Mahasiswa Kebumen Di Ui
Mengarang Slide Untuk Mahasiswa Kebumen Di UiMengarang Slide Untuk Mahasiswa Kebumen Di Ui
Mengarang Slide Untuk Mahasiswa Kebumen Di Ui
 
Cinque Terre, Italy
Cinque  Terre,  ItalyCinque  Terre,  Italy
Cinque Terre, Italy
 
Революция без крови
 Революция без крови Революция без крови
Революция без крови
 
Разработка современной электроники с прицелом на массовый выпуск. На чем?
Разработка современной электроники с прицелом на массовый выпуск. На чем?Разработка современной электроники с прицелом на массовый выпуск. На чем?
Разработка современной электроники с прицелом на массовый выпуск. На чем?
 
Unit 7d Consolidating superannuation
Unit 7d Consolidating superannuationUnit 7d Consolidating superannuation
Unit 7d Consolidating superannuation
 
Unit 19b Wills
Unit 19b WillsUnit 19b Wills
Unit 19b Wills
 
Разработка современной электроники с прицелом на массовый выпуск. Почем?
Разработка современной электроники с прицелом на массовый выпуск. Почем?Разработка современной электроники с прицелом на массовый выпуск. Почем?
Разработка современной электроники с прицелом на массовый выпуск. Почем?
 
Giuliano Bekor
Giuliano BekorGiuliano Bekor
Giuliano Bekor
 
The Night We Started Dancing
The Night We Started DancingThe Night We Started Dancing
The Night We Started Dancing
 
WTR Club OPP
WTR Club OPPWTR Club OPP
WTR Club OPP
 
TWTRCON DC 09 Recruiting
TWTRCON DC 09 RecruitingTWTRCON DC 09 Recruiting
TWTRCON DC 09 Recruiting
 
I T Tjej Liselotte Norén 20100325
I T Tjej  Liselotte  Norén 20100325I T Tjej  Liselotte  Norén 20100325
I T Tjej Liselotte Norén 20100325
 
Powerpoint Fiesta Jesus Elices 6 Horas @ Sala Versus (20 02 2010)
Powerpoint Fiesta Jesus Elices 6 Horas @ Sala Versus (20 02 2010)Powerpoint Fiesta Jesus Elices 6 Horas @ Sala Versus (20 02 2010)
Powerpoint Fiesta Jesus Elices 6 Horas @ Sala Versus (20 02 2010)
 

Similar to Suicide Risk Prediction Using Social Media and Cassandra

C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...DataStax Academy
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...DataStax
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandrazznate
 
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)Richard Low
 
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark DataStax Academy
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBMapR Technologies
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphScyllaDB
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Applicationsupertom
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)zznate
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandraPL dream
 
Migrating from a Relational Database to Cassandra: Why, Where, When and How
Migrating from a Relational Database to Cassandra: Why, Where, When and HowMigrating from a Relational Database to Cassandra: Why, Where, When and How
Migrating from a Relational Database to Cassandra: Why, Where, When and HowAnant Corporation
 
Signal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide RowsSignal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide RowsDataStax Academy
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchSylvain Wallez
 
Redis Streams plus Spark Structured Streaming
Redis Streams plus Spark Structured StreamingRedis Streams plus Spark Structured Streaming
Redis Streams plus Spark Structured StreamingDave Nielsen
 
Elassandra schema management - Apache Con 2019
Elassandra schema management - Apache Con 2019Elassandra schema management - Apache Con 2019
Elassandra schema management - Apache Con 2019Vincent Royer
 

Similar to Suicide Risk Prediction Using Social Media and Cassandra (20)

C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
 
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
 
Presentation
PresentationPresentation
Presentation
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraph
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Application
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
 
Migrating from a Relational Database to Cassandra: Why, Where, When and How
Migrating from a Relational Database to Cassandra: Why, Where, When and HowMigrating from a Relational Database to Cassandra: Why, Where, When and How
Migrating from a Relational Database to Cassandra: Why, Where, When and How
 
Signal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide RowsSignal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide Rows
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
Redis Streams plus Spark Structured Streaming
Redis Streams plus Spark Structured StreamingRedis Streams plus Spark Structured Streaming
Redis Streams plus Spark Structured Streaming
 
Elassandra schema management - Apache Con 2019
Elassandra schema management - Apache Con 2019Elassandra schema management - Apache Con 2019
Elassandra schema management - Apache Con 2019
 

More from Ken Krugler

Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrKen Krugler
 
Strata web mining tutorial
Strata web mining tutorialStrata web mining tutorial
Strata web mining tutorialKen Krugler
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to HadoopKen Krugler
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big dataKen Krugler
 
Thinking at scale with hadoop
Thinking at scale with hadoopThinking at scale with hadoop
Thinking at scale with hadoopKen Krugler
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web MiningKen Krugler
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web MiningKen Krugler
 

More from Ken Krugler (7)

Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
 
Strata web mining tutorial
Strata web mining tutorialStrata web mining tutorial
Strata web mining tutorial
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to Hadoop
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big data
 
Thinking at scale with hadoop
Thinking at scale with hadoopThinking at scale with hadoop
Thinking at scale with hadoop
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 

Recently uploaded

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Suicide Risk Prediction Using Social Media and Cassandra

  • 1. #CASSANDRA13 Ken  Krugler  |  President,  Scale  Unlimited Suicide Prevention Using Social Media and Cassandra
  • 2. #CASSANDRA13 What we will discuss today... *Using Cassandra to store social media content *Combining Hadoop workflows with Cassandra *Leveraging Solr search support in DataStax Enterprise *Doing good with big data This material is based upon work supported by the Defense Advance Research Project Agency (DARPA), and Space Warfare Systems Center Pacific under Contract N66001-11-4006. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the Defense Advance Research Program Agency (DARPA) and Space and Naval Warfare Systems Center Pacific. Fine Print!
  • 3. #CASSANDRA13 Obligatory Background *Ken Krugler, Scale Unlimited - Nevada City, CA *Consulting on big data workflows, machine learning & search *Training for Hadoop, Cascading, Solr & Cassandra
  • 5. #CASSANDRA13 What's the problem? *More soldiers die from suicide than combat *Suicide rate has gone up 80% since 2002 *Civilian suicide rates are also climbing *More suicides than homicides *Intervention after an "event" is often too late Graph of suicide rates
  • 6. #CASSANDRA13 What is The Durkheim Project? *DARPA-funded initiative to help military physicians *Uses predictive analytics to estimate suicide risk from what people write online *Each user is assigned a suicidality risk rating of red, yellow or green. Émile Durkheim Named after Emile Durkheim, late 1800's sociologist who first used text analytics to help define suicide risk.
  • 7. #CASSANDRA13 Current Status of Durkheim *Collaborative effort involving Patterns and Predictions, Dartmouth Medical School & Facebook *Details at http://www.durkheimproject.org/ *Finished phase I, now being rolled out to wider audience Patterns and Predictions has its background expertise in predicting financial market events and trends from news, which led to the development of the predictive models used in Durkheim
  • 8. #CASSANDRA13 Predictive Analytics *Guessing at state of mind from text -"There are very few people in this world that know the REAL me." -"I lay down to go to sleep, but all I can do is cry" *Uses labeled training data from clinical notes *Phase I results promising, for small sample set -"ensemble" of predictors is a powerful ML technique
  • 9. #CASSANDRA13 Clinician Dashboard *Multiple views on patient *Prediction & confidence *Backing data (key phrases, etc) So this is the goal - give medical staff indications of who they should be most concerned about.
  • 10. #CASSANDRA13 Data Collection Where _do_ you put a billion text snippets? The previous section was the project overview, which was work done by others in the project. Now we get to the part that we worked, which involves Cassandra
  • 11. #CASSANDRA13 Saving Social Media Activity *System to continuous save new activity -Scalable data store *Also needs a scalable, reliable way to access data -Processed in bulk (workflows) -Accessed at individual level -Searched at activity level For the current size of the project, MySQL would be just fine. But we want an architecture that can scale if/when the project is rolled out to everyone
  • 12. #CASSANDRA13 Data Collection *Pink is what we wrote *Green is in Cassandra *Key data path in red Exciting Social Media Activity Gigya Daemon Durkheim Social API Users Table Durkheim App Gigya Service Activity Table
  • 13. #CASSANDRA13 Designing the Column Families *What queries do we need to handle? -Always by user id (what we assign) *We want all the data for a user -Both for Users table, and Activities table -Sometimes we want a date range of activities *So one row per user -And ordered by date in the Activities table
  • 14. #CASSANDRA13 Users Table (Column Family) *One row per user - row key is a UUID we assign *Standard "static" columns -First name, last name, opt_in status, etc. *Easy to add more xxx_id columns for new services row key first_name last_name facebook_id twitter_id opt_in
  • 15. #CASSANDRA13 Activities Table (Column Family) *One row per user - row key is a UUID we assign *One composite column per social media event -Timestamp (long value) -Source (FB, TW, GP, etc) -Type of column (data, activity id, user id, type of activity) row key ts_src_data ts_src_id ts_src_providerUid ts_src_type Remember we wanted to get slices of data by date? So we use timestamp as the first (primary) ordering for the columns. We can use regular millisecond timestamp since it's for one user, assume we don't get multiple entries.
  • 16. #CASSANDRA13 Two Views of Composite Columns *As a row/column view *As a key-value map 213_FB_data 213_FB_id 213_FB_providerUid 213_FB_type "I feel tired" "FB post #32" "FB user #66" "Status update" "uuid1" "uuid1" 213_FB_data 213_FB_id 213_FB_providerUid 213_FB_type "I feel tired" "FB post #32" "FB user #66" "Status update"
  • 17. #CASSANDRA13 Implementation Details *API access protected via signature *Gigya Daemon on both t1.micro servers -But only active on one of them *Astyanax client talks to Cassandra *Cluster uses 3 m1.large servers Durkheim Social API Durkheim App AWS Load Balancer EC2 m1.large servers Durkheim Social API EC2 t1.micro servers
  • 18. #CASSANDRA13 Predictive Analytics at Scale Running workflows against Cassandra data
  • 19. #CASSANDRA13 How to process all this social media goodness? *Models are defined elsewhere *These are "black boxes" to us 213_FB_data 213_FB_id 213_FB_providerUid 213_FB_type "I feel tired" "FB post #32" "FB user #66" "Status update" 307_TW_data 307_TW_id 307_TW_providerUid 307_TW_type "Where am I?" "Tweet #17" "TW user #109" "Tweet" Feature Extraction Model model rating probability keywords Models are data used by PA engine to generate scores We do not have or want access to the data used to generate the models Generating the model is often NOT something that needs scalability Amount of labeled data is typically pretty small. Training often works best on a single server.
  • 20. #CASSANDRA13 Why do we need Hadoop? *Running one model on one user is easy -And n models on one user is still OK *But when a model changes... -all users with the model need processing Models can change frequently And when a user changes... - that user with all models needs processing - adding/removing models is also a change
  • 21. #CASSANDRA13 Batch processing is OK *No strict minimum latency requirements *So we use Hadoop, for scalability and reliability
  • 22. #CASSANDRA13 Hadoop Workflow Details *Implemented using Cascading *Read Activities Table using Cassandra Tap *Read models from MySQL via JDBC
  • 23. #CASSANDRA13 Hadoop Bulk Classification Workflow Convert to Cassandra Write Classification Result Table Run Classifier models CoGroup by user profile ID Convert from Cassandra Read User Profiles Table Convert from Cassandra Read Social Media Activity Table Separate from this, we've loaded the models into memory and serialized them with the classification step This is all done using Cascading to define the workflow.
  • 24. #CASSANDRA13 Workflow Issues *Currently manual operation -Ultimately needs a daemon to trigger (time, users, models) *Runs in separate cluster -Lots of network activity to pull data from Cassandra cluster -With DSE we could run on same cluster *Fun with AWS security groups
  • 26. #CASSANDRA13 Solr Search *Model results include key terms for classification result -"feel angry" (0.732) *Now you want to check actual usage of these terms Maybe actual text was "I don't feel angry when my wifi connection drops".
  • 27. #CASSANDRA13 Poking at the Data *Hadoop turns petabytes into pie-charts *How do you verify results? *Search works really well here Maybe before you'd use a spreadsheet printout to argue. But that would be Satan's Spreadsheet with billions of rows.
  • 28. #CASSANDRA13 Solr Search *Want "narrow" table for search -Solr dynamic fields are usually not a great idea -Limit to 1024 dynamic fields per document *So we'll replicate some of our Activity CF data into a new CF *Don't be afraid of making copies of data
  • 29. #CASSANDRA13 The "Search" Column Family *Row key is derived from Activity CF UUID + target column name *One column ("data") has content from that row + column in Activity CF row key "data" "uuid1_213_FB "I feel tired" "uuid1" 213_FB_data 213_FB_id "I feel tired" "FB post #32" Activity Column Family Search Column Family
  • 30. #CASSANDRA13 Solr Schema *Very simple (which is how we like it) *Direct one-to-one mapping with Cassandra columns *Hits have key field, which contains UUID/Timestamp/Service <fields> <field name="key" type="string" indexed="true" stored="true" /> <field name="data" type="text" indexed="true" stored="true" /> </fields> So once we have a hit, we can access information in activity table if needed.
  • 31. #CASSANDRA13 Combined Cluster *One Cassandra Cluster can allocate nodes for Hadoop & Search
  • 33. #CASSANDRA13 The Most Important Detail *We don't have any personal medical data!!! *We don't have any personal medical data!!! *We don't have any personal medical data!!! As soon as you've got personal medical data, it's a whole new ballgame. At least an order of magnitude more work to make it really secure. Likely that you couldn't use AWS cloud We still care about security, because we're collecting social media activity that isn't necessarily public.
  • 34. #CASSANDRA13 Three Aspects of Security *Server-level -ssh via restricted private key *API-level -validate requests using signature -secure SHA1 hash *Services-level -Restrict open ports using security groups So even if you knew which server was running OpsCenter, you couldn't just start poking around. Access to Cassandra is only via t1.micro servers, which are in same security group t1.micro servers only open up ssh and port needed for external API request [include picture?]
  • 36. #CASSANDRA13 *You can effectively use Cassandra as: A repository for social media data The data source for workflows A search index, via Solr integration Key Points...
  • 37. #CASSANDRA13 *It is possible to do more with big data than optimize ad yields And the Meta-Point