SlideShare uma empresa Scribd logo
1 de 25
How AdMobius uses Cascading in
AdTech Stack
Jyotirmoy Sundi
Sr Data Engineer in Lotame
(Acquired by LOTAME on March, 2014)
What does AdMobius do

AdMobius is a Mobile Audience Management
Platform (MAMP). It helps advertiser identify
mobile audiences by demographics and interest
through standard, custom, private segments
and reach them at scale.
Target effectively across all platforms in multiple devices
Laptop
Mobile
Ipod
Ipad
Wearables
Topics

Device graph building and scoring device links

Cascading Taps for Hive, MySQL, HBase

Modularized Testing

Optimal Config Setups

Running in YARN

Conclusion
AdMobius Stack
Cascading | Hive | Hbase | GiraphCascading | Hive | Hbase | Giraph
Hadoop | (Experimental Spark)Hadoop | (Experimental Spark)
RackspaceRackspace
YARN | MR1YARN | MR1
Custom WorkflowsCustom Workflows

Why Cascading
− Easy custom aggregators.
• In the existing MR framework it was very difficult
to write a series of complex aggregated logic and
run them in scale before making sure of its
correctness. You can do that in hive by UDFs or
UDAFs but we found it much easier in Cascading.
− Easy for Java Developers to understand
• visualize and write complicated workflows though
the concept of pipes, taps, tuples.
Workflow for audience profile scoring
Driven
https://driven.cascading.io/index.html#/apps/D818DD
Audience Profiling

Cascading is used to do
− complex aggregations
− create the device multi-dimensional vectors
− device pair scoring based on the vectors
− rule engine based filters

Size
− Total number of mobile devices ~ 2.7B
− ~500M devices in Giraph computation.
Example: Parallel aggregation of values across multiple fields.
Aggregations

No need to know group modes like in UDAF

Buffer

use for more complex grouping
operations

output multiple tuples per group

Aggregator (simple aggregations, prebuilt
aggregators like SumBy, CountBy)
public class MinGraphScoring extends BaseOperation implements Buffer{
@Override
public void operate(FlowProcess flowProcess, BufferCall bufferCall) {
Iterator<TupleEntry> arguments = bufferCall.getArgumentsIterator();
Graph g = new Graph();
while( arguments.hasNext() )
{
TupleEntry tpe = arguments.next();
ByteBuffer b = ByteBuffer.wrap((byte[])tpe.getObject("field1"););//use kyro
serialization
g.put(b)
}
Node[] nodes = g.nodes;
//For each pair of nodes : i,j {
double minmaxscore = scoring(g,i,j)
Tuple t1 = new Tuple(nodes[i].id ,nodes[j].id ,minmaxscore);
bufferCall.getOutputCollector().add(t1);
}
}
public class PotentialMatchAggregator extends
BaseOperation<PotentialMatchAggregator.IDList> implements
Aggregator<PotentialMatchAggregator.IDList> {
start(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall) {
IDList idList = new IDList();
aggregatorCall.setContext(idList);
}
aggregate(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall)
{
TupleEntry arguments = aggregatorCall.getArguments();
IDList idList = aggregatorCall.getContext();
idList.updateDev(amid, match);
}
complete(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall)
{
IDList idList = aggregatorCall.getContext();
…...
}
Joins

CoGroup:

two pipes cant fit into memory

HashJoin

when one of the pipes fit into memory
Pipe jointermsPipe = new HashJoin(termsPipe, new
Fields("term_token"),dictionary, new Fields("word"), new
Fields("app","term_token","score","d_count","index","word"), new
InnerJoin());

CustomJoins and BloomJoin
Custom Src/Sink Taps

Cascading has good support to read/write to/from different form of
data sources. Slight tuning or change might be required but most of
code already exists.
− Hive (with different file formats), HBase, MySQL
− http://www.cascading.org/extensions/
− Set proper Config parameters while reading from source tap,
example while reading from Hbase Tap,
String tableName = "device_ids";
String[] familyNames = new String[] { "id:type1", "id:type2",
“id:type3”,...”id:typen” };
Scan scan = new Scan();
scan.setCacheBlocks(false);
scan.setCaching(10000);
scan.setBatch(10000);
Hive Src TapsExampleWorkflow.java
Tap dmTap = new HiveTableTap(HiveTableTap.SchemeType.SEQUENCE_FILE, admoFPbase, admoFPBasePartitions, dmFullFilter);
HiveTableTap.java
public class HiveTableTap extends GlobHfs {
static Scheme getScheme(SchemeType st) {
if(st.equals(SchemeType.SEQUENCE_FILE))
return new AdmobiusWritableSequenceFile(new Fields("value"), BytesWritable.class);
else if(st.equals(SchemeType.TEXT_TSV))
return new TextDelimited();
else
return null;
}
…..
}
Hive Sink Taps
ExampleWorkflow.java
Tap srcDstIdsSinkTap = new Hfs(new AdmobiusWritableSequenceFile(new Fields("value"), (Class<? extends Writable>)
Text.class),"/tmp/srcDstIdsSinkTap" , SinkMode.REPLACE);
HiveTableTap.java
public class HiveTableTap extends GlobHfs {
static Scheme getScheme(SchemeType st) {
if(st.equals(SchemeType.SEQUENCE_FILE))
return new AdmobiusWritableSequenceFile(new Fields("value"), BytesWritable.class);
else if(st.equals(SchemeType.TEXT_TSV))
return new TextDelimited();
else
return null;
}
…..
}
conf.setOutputFormat( SequenceFileOutputFormat.class );
valueValue = (Writable) (new Text(tupleEntry.getObject( 0 ).toString().getBytes()));
Hive table
CREATE TABLE CASCADING_HIVE_INTER
(
admo_id string,
segments string
)
PARTITIONED BY ( batch_id STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
STORED AS SEQUENCEFILE
Good Practices

Use Checkpointing optimally

Use subassemblies instead of rewriting logic.
For further control pass additional parameters
to subassemblies.

Use Compression and SequenceFile() in sink
taps to chain multiple cascading workflows.

Use Failure Traps to filter faulty records.

Avoid creating too small or too long workflows.
Chain them in Oozie or similar workflow
management engines
− Example: workflows with 10-20 MR jobs are good
Some Properties for Optimal Performance
Problems with improper configuration
1. Set compression parameters : Jobs would run slow and
may take sometime double the time. Set the correct
compression Type based on cluster configs
2. mapred.reduce.tasks : Its required to be set manually
depending on the size of your job. Keeping it too low would
slow down reducer jobs.
3. small file issue : The input split files read by mappers
would be too small eventually bringing up more mappers
then required.
4. Any custom configuration parameters : You should set it
here and use getProperty to access them anywhere in the
data workflow
properties.setProperty("min_cutoff_score", "0.7");
FlowConnector flowConnector = new HadoopFlowConnector(properties);
Running in Yarn

Yarn deployment is smooth with cascading 2.5
− Make sure the config properties are set as per
YARN as they are different from MR1.
− While running in in workflow engines like oozie ,
make sure properties are set for
• mapred.job.classpath.files and mapred.cache.file
are set with all dependency files in colon
separated formatted
Cascading DSLs in other languages
Scalding (Scala)
PyCascading (Python)
cascading.jruby (Jruby)
Cascalog (Closure)

Thank you for your time

Q & A

Mais conteúdo relacionado

Mais procurados

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 

Mais procurados (20)

Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola Peric
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engine
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
 
[ppt]
[ppt][ppt]
[ppt]
 
Taming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using TelegrafTaming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using Telegraf
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
Apache Nemo
Apache NemoApache Nemo
Apache Nemo
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Scientific computing on jruby
Scientific computing on jrubyScientific computing on jruby
Scientific computing on jruby
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
 
Scientific Computation on JRuby
Scientific Computation on JRubyScientific Computation on JRuby
Scientific Computation on JRuby
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 

Destaque (6)

Estrategias de aprendizaje
Estrategias de aprendizajeEstrategias de aprendizaje
Estrategias de aprendizaje
 
June 2011 news
June 2011 newsJune 2011 news
June 2011 news
 
Presentation qp induction presentation asst5_25_sep11_vers d0
Presentation qp induction presentation asst5_25_sep11_vers d0Presentation qp induction presentation asst5_25_sep11_vers d0
Presentation qp induction presentation asst5_25_sep11_vers d0
 
Savills - Insights - World Class Cities
Savills - Insights - World Class CitiesSavills - Insights - World Class Cities
Savills - Insights - World Class Cities
 
Internet trends-2011
Internet trends-2011Internet trends-2011
Internet trends-2011
 
OpenSHIP - Project presentation IT
OpenSHIP - Project presentation ITOpenSHIP - Project presentation IT
OpenSHIP - Project presentation IT
 

Semelhante a Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)

Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
Fei Dong
 

Semelhante a Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/) (20)

Amazon elastic map reduce
Amazon elastic map reduceAmazon elastic map reduce
Amazon elastic map reduce
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Storm
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
k-means algorithm implementation on Hadoop
k-means algorithm implementation on Hadoopk-means algorithm implementation on Hadoop
k-means algorithm implementation on Hadoop
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
Finagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at PinterestFinagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at Pinterest
 
Ae31225230
Ae31225230Ae31225230
Ae31225230
 
Implementation of k means algorithm on Hadoop
Implementation of k means algorithm on HadoopImplementation of k means algorithm on Hadoop
Implementation of k means algorithm on Hadoop
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafka
 
MarGotAspect - An AspectC++ code generator for the mARGOt framework
MarGotAspect - An AspectC++ code generator for the mARGOt frameworkMarGotAspect - An AspectC++ code generator for the mARGOt framework
MarGotAspect - An AspectC++ code generator for the mARGOt framework
 
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible ComputationEndofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native Compilation
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
 
Gephi Toolkit Tutorial
Gephi Toolkit TutorialGephi Toolkit Tutorial
Gephi Toolkit Tutorial
 
Azure machine learning service
Azure machine learning serviceAzure machine learning service
Azure machine learning service
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profiler
 

Último

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 

Último (20)

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)

  • 1. How AdMobius uses Cascading in AdTech Stack Jyotirmoy Sundi Sr Data Engineer in Lotame (Acquired by LOTAME on March, 2014)
  • 2. What does AdMobius do  AdMobius is a Mobile Audience Management Platform (MAMP). It helps advertiser identify mobile audiences by demographics and interest through standard, custom, private segments and reach them at scale.
  • 3. Target effectively across all platforms in multiple devices Laptop Mobile Ipod Ipad Wearables
  • 4. Topics  Device graph building and scoring device links  Cascading Taps for Hive, MySQL, HBase  Modularized Testing  Optimal Config Setups  Running in YARN  Conclusion
  • 5. AdMobius Stack Cascading | Hive | Hbase | GiraphCascading | Hive | Hbase | Giraph Hadoop | (Experimental Spark)Hadoop | (Experimental Spark) RackspaceRackspace YARN | MR1YARN | MR1 Custom WorkflowsCustom Workflows
  • 6.  Why Cascading − Easy custom aggregators. • In the existing MR framework it was very difficult to write a series of complex aggregated logic and run them in scale before making sure of its correctness. You can do that in hive by UDFs or UDAFs but we found it much easier in Cascading. − Easy for Java Developers to understand • visualize and write complicated workflows though the concept of pipes, taps, tuples.
  • 7. Workflow for audience profile scoring
  • 9.
  • 10. Audience Profiling  Cascading is used to do − complex aggregations − create the device multi-dimensional vectors − device pair scoring based on the vectors − rule engine based filters  Size − Total number of mobile devices ~ 2.7B − ~500M devices in Giraph computation.
  • 11. Example: Parallel aggregation of values across multiple fields.
  • 12. Aggregations  No need to know group modes like in UDAF  Buffer  use for more complex grouping operations  output multiple tuples per group  Aggregator (simple aggregations, prebuilt aggregators like SumBy, CountBy)
  • 13. public class MinGraphScoring extends BaseOperation implements Buffer{ @Override public void operate(FlowProcess flowProcess, BufferCall bufferCall) { Iterator<TupleEntry> arguments = bufferCall.getArgumentsIterator(); Graph g = new Graph(); while( arguments.hasNext() ) { TupleEntry tpe = arguments.next(); ByteBuffer b = ByteBuffer.wrap((byte[])tpe.getObject("field1"););//use kyro serialization g.put(b) } Node[] nodes = g.nodes; //For each pair of nodes : i,j { double minmaxscore = scoring(g,i,j) Tuple t1 = new Tuple(nodes[i].id ,nodes[j].id ,minmaxscore); bufferCall.getOutputCollector().add(t1); } }
  • 14. public class PotentialMatchAggregator extends BaseOperation<PotentialMatchAggregator.IDList> implements Aggregator<PotentialMatchAggregator.IDList> { start(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall) { IDList idList = new IDList(); aggregatorCall.setContext(idList); } aggregate(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall) { TupleEntry arguments = aggregatorCall.getArguments(); IDList idList = aggregatorCall.getContext(); idList.updateDev(amid, match); } complete(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall) { IDList idList = aggregatorCall.getContext(); …... }
  • 15. Joins  CoGroup:  two pipes cant fit into memory  HashJoin  when one of the pipes fit into memory Pipe jointermsPipe = new HashJoin(termsPipe, new Fields("term_token"),dictionary, new Fields("word"), new Fields("app","term_token","score","d_count","index","word"), new InnerJoin());  CustomJoins and BloomJoin
  • 16. Custom Src/Sink Taps  Cascading has good support to read/write to/from different form of data sources. Slight tuning or change might be required but most of code already exists. − Hive (with different file formats), HBase, MySQL − http://www.cascading.org/extensions/ − Set proper Config parameters while reading from source tap, example while reading from Hbase Tap, String tableName = "device_ids"; String[] familyNames = new String[] { "id:type1", "id:type2", “id:type3”,...”id:typen” }; Scan scan = new Scan(); scan.setCacheBlocks(false); scan.setCaching(10000); scan.setBatch(10000);
  • 17. Hive Src TapsExampleWorkflow.java Tap dmTap = new HiveTableTap(HiveTableTap.SchemeType.SEQUENCE_FILE, admoFPbase, admoFPBasePartitions, dmFullFilter); HiveTableTap.java public class HiveTableTap extends GlobHfs { static Scheme getScheme(SchemeType st) { if(st.equals(SchemeType.SEQUENCE_FILE)) return new AdmobiusWritableSequenceFile(new Fields("value"), BytesWritable.class); else if(st.equals(SchemeType.TEXT_TSV)) return new TextDelimited(); else return null; } ….. }
  • 18. Hive Sink Taps ExampleWorkflow.java Tap srcDstIdsSinkTap = new Hfs(new AdmobiusWritableSequenceFile(new Fields("value"), (Class<? extends Writable>) Text.class),"/tmp/srcDstIdsSinkTap" , SinkMode.REPLACE); HiveTableTap.java public class HiveTableTap extends GlobHfs { static Scheme getScheme(SchemeType st) { if(st.equals(SchemeType.SEQUENCE_FILE)) return new AdmobiusWritableSequenceFile(new Fields("value"), BytesWritable.class); else if(st.equals(SchemeType.TEXT_TSV)) return new TextDelimited(); else return null; } ….. } conf.setOutputFormat( SequenceFileOutputFormat.class ); valueValue = (Writable) (new Text(tupleEntry.getObject( 0 ).toString().getBytes()));
  • 19. Hive table CREATE TABLE CASCADING_HIVE_INTER ( admo_id string, segments string ) PARTITIONED BY ( batch_id STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS SEQUENCEFILE
  • 20. Good Practices  Use Checkpointing optimally  Use subassemblies instead of rewriting logic. For further control pass additional parameters to subassemblies.  Use Compression and SequenceFile() in sink taps to chain multiple cascading workflows.  Use Failure Traps to filter faulty records.  Avoid creating too small or too long workflows. Chain them in Oozie or similar workflow management engines − Example: workflows with 10-20 MR jobs are good
  • 21. Some Properties for Optimal Performance
  • 22. Problems with improper configuration 1. Set compression parameters : Jobs would run slow and may take sometime double the time. Set the correct compression Type based on cluster configs 2. mapred.reduce.tasks : Its required to be set manually depending on the size of your job. Keeping it too low would slow down reducer jobs. 3. small file issue : The input split files read by mappers would be too small eventually bringing up more mappers then required. 4. Any custom configuration parameters : You should set it here and use getProperty to access them anywhere in the data workflow properties.setProperty("min_cutoff_score", "0.7"); FlowConnector flowConnector = new HadoopFlowConnector(properties);
  • 23. Running in Yarn  Yarn deployment is smooth with cascading 2.5 − Make sure the config properties are set as per YARN as they are different from MR1. − While running in in workflow engines like oozie , make sure properties are set for • mapred.job.classpath.files and mapred.cache.file are set with all dependency files in colon separated formatted
  • 24. Cascading DSLs in other languages Scalding (Scala) PyCascading (Python) cascading.jruby (Jruby) Cascalog (Closure)
  • 25.  Thank you for your time  Q & A