Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/health-insurance-predictive-analysis-with-hadoop-and-machine-learning/julien-cabot
4. Benefits for Insurance Company?
Understand the subject of interest of the
patient to design customer-centric products
and marketing actions
Anticipate the psycho-social effect due to
Internet to prevent excessive consultations
(and reimbursements)
Predict the claims while monitoring the
request about symptoms and drugs
4
6. The data problem
Understand the semantic field of
Healthcare…used on Internet
Find correlation between the evolution of
claims and … many millions of unidentified
external variables
Find correlated variables… anticipating the
claims
We need some help from Machine Learning !
6
7. Correlation search in external datasets
Automated tokenization of Google search Socio-economical
message per posted date volume of symptom context from Open
and semantic tagging and drugs keywords Data initiatives
Trends of medical Trends of medical
Trends of socio-
keywords used in keywords searched in
economical factors
forums Google
Determination
Health claims by Correlation
coeff. (R²) sorted
act typology Search Machine matrix
7
8. Understand the semantic field of Healthcare
Message Word stemming, tagging Timelines of
tokenization and common word healthcare
by date filtering with NTLK key words
How to tag Healthcare
words?
1-Build a first list of
keywords
Healthcare
semantic
2-Enrich the list
with highly field
searched keywords keywords
database
3-Learn
automatically from
Wikipedia Medical
Categories
8
9. How to find correlations between time series?
Compare the evolution of the variable and the claims over the time
Find non linear regression and learn a polymorphic predictive function
f(x) from the dataset with Support Vector Regression (SVR)
y Problem to solve
f(x) + ε 1 𝑇
min 𝑤 . 𝑤
f(x) w 2
f(x) - ε
𝑦 𝑖 - (𝑤 𝑇 ·ϕ(x) + b) ≤ ε
(𝑤 𝑇 ·ϕ(x) + b) - 𝑦 𝑖 ≤ ε
Resolution
x • Stochastic gradient descendent
• Test the response through the coef.
of determination R²
Open source ML library helps!
9
10. Data Processing Profiles
The current volume of external data grabbed is large but not so huge (~10 Gb)
Data aggregation
Eg. Select … Group By Date
Data volume
Correlation search ~5Gb . 123 = 8,64 Tb
Eg. SVR computing
Data volume
We need Parallel Computing to divide
RAM requirement and time processing !
10
12. IT drivers
Requirements IT drivers
Aggregate data
from Mb to Gb file Data
while sequential IO Elasticity
aggregation
reading
SVR, NLP Large Tasks
execution time is CPU Elasticity
~100ms by task execution
Process many Tb Large RAM
in memory data RAM Elasticity
execution
Commodity HW
Increase the ROI of Low CAPEX
the research OSS SW
project while
decreasing the
TCO
Low OPEX Cost Elasticity
12
13. Available solutions
RAM Elasticity
OSS Software
CPU Elasticity
Cost Elasticity
IO Elasticity
Commodity
Hardware
RDBMS
In Memory analytics
HPC
Hadoop
With With With
repartitioning repartitioning repartitioning
AWS Elastic MapReduce
Through Task Through Task
13
15. Hadoop components
Custom App Dataming tools BI tools
Java, C#, PHP, … R, SAS Tableau, Pentaho, …
Hue Pig Streaming Hive
Hadoop GUI Flow processing MR scripting SQL-like querying
Oozie MapReduce Zookeeper
MR workflow Parallel processing framework Coordination service
Mahout Sqoop
Machine Learning
RDBMS integration
Hama
Bulk synchronous Flume
processing Data stream integration
Solr HBase
Full text search NoSQL on HDFS
HDFS
Distributed file storage
Grid of commodity hardware – storage and processing
15
16. General architecture of the platform
DataViz Application
• Store detailed
results for
• Store raw drill down
data AWS S3 Redis
• Store results
files
Core Task Master
Instance 1 Instance 1 Instance
Core Task
Instance 2 Instance 2
Task • For SVR and
2 x m2.4xlarge
Instances 3 NLP
processing,
&4 only
4 x m2.4xlarge
16
17. Data aggregation with Pig Job flow
Num_of_messages_by_date.pig
records = LOAD ‘/input/forums/messages.txt’
AS (str_date:chararray, message:chararray,
url:chararray);
date_grouped = GROUP records BY str_date
results = FOREACH date_grouped GENERATE
group, COUNT(records);
DUMP results;
17
18. Hadoop streaming
Hadoop streaming runs map/reduce jobs with any
executables or scripts through standard input and
standard output
It looks like that (on a cluster) :
cat input.txt | map.py | sort | reduce.py
Why Hadoop streaming?
Intensive use of NLTK for Natural Language Processing
Intensive use of NumPy and Sklearn for Machine Learning
18
19. Stemmed word distribution with Hadoop streaming, mapper.py
Stem_distribution_by_date/mapper.py
import sys
import nltk
from nltk.tokenize import regexp_tokenize
from nltk.stem.snowball import FrenchStemmer
# input comes from STDIN (standard input)
for line in sys.stdin:
line = line.strip()
str_date, message, url = line.split(";")
stemmer = FrenchStemmer("french")
tokens = regexp_tokenize(message, pattern='w+')
for token in tokens:
word = stemmer.stem(token)
if len(word) >= 3:
print '%s;%s' % (word, str_date)
19
20. Stemmed word distribution with Hadoop streaming, reducer.py
Stem_distribution_by_date/reducer.py
import sys
import json
from itertools import groupby
from operator import itemgetter
from nltk.probability import FreqDist
def read(f):
for line in f:
line = line.strip()
yield line.split(';')
data = read(sys.stdin)
for current_stem, group in groupby(data, itemgetter(0)):
values = [item[1] for item in group]
freq_dist = FreqDist()
print "%s;%s" % (current_stem, json.dumps(freq_dist))
20
22. Conclusions
The correlation search identifies currently 462 variables correlated with a R² >= 80%
and a lag >= 1 month
Amazon Elastic MapReduce provides the elasticity required by the morphology of
the jobs and the cost elasticity
Monthly cost with zero activity : < 5 €
Monthly cost with intensive activity : < 1 000 €
The equivalent cost of the platform would be around 50 000 €
The S3 transfer overhead is not a problem due the volume of stored data
While Correlation search processing, only 80% max of the virtual CPU are
used due to job scheduling with a parallelism factor of 36 instead of 48
regarding SMP
22
23. Future works
Data mining
Increase the number of data sources
Testing the robustness of the predictive model over the time
Reducing the over fitting of the correlation
Enhance the correlation search for word while testing combinations
IT
Switch only the correlation search to a map reduce engine for SMP
architecture and cluster of cores, inspired by the Stanford Phoenix and the
Nokia Disco engine
Industrialize the data mining components as a platform for generalization to
IARD insurance, banking, e-commerce, telecoms and retails
23
24. OCTO in a nutshell
Big data Analytics Offer
Business case and benchmark studies
Business Proof of Concept
Data feeds : Web Trends
Big Data and Analytics architecture design
Big data project delivery
Training, seminar : Big Data, Hadoop
IT Consulting firm OCTO offices
Established in 1998
175 employees
19,5 million turnover worldwide (2011)
Verticals-based organization
Banking – Financial Services
Insurance
Media – Internet – Leisure
Industry – Distribution
Telecom – Services
24