3. About BloomReach’s applications
Organic
Search
Contentunderstanding
What it does
Content optimization,
management and measureme
nt
Benefit
Enhanced discoverability and
customer acquisition in organic searc
h
What it does
Personalized onsite search an
d
navigation across devices
Benefit
Relevant and consistent onsite
experiences for new and known users
What it does
Merchandising tool that understa
nds products and identifies oppo
rtunities
Benefit
Prioritize and optimize
online merchandising
SNAP
Compas
s
7. Elastic MapReduce (EMR)
Usage
• We serve 150+ customer websites
100+ million pages processed/ day
Users we see per day > 400M
Multiple hadoop steps (clusters)
Usage Metric BloomReach Volume
Clusters per day 1500-2000
Hadoop jobs per day 5000-6000
Instance hours per day 25,000 – 30,000
10. Resource Selection
• Dynamic resource (instance type)
selection based on CPU, memory
maxCpuPerUnitPrice = 0
optimalInstanceType = null
For each instance_type in (Availability Zone, Region)
{
cpuPerUnitPrice =
instance.cpuCores/instance.spotPrice
if (maxCpuPerUnitPrice < cpuPerUnitPrice) {
optimalInstanceType = instance_type;
}
}
11. Workflow Management
• Makefile
• A framework for flow control using python
meta programming
A
C B
D
Valid Flows:
A->B->C->D
A->B->D->C
12. EMR Best Practices
• Use spot instances for cost optimization
• Use EMR tags for cost tracking
• Share EMR clusters for small jobs
• Keep track of long-running clusters
• Use optimal resource type based on
resource usage (e.g. CPU, memory)
• Workflow management