Lessons learned while building a solution to crunch 100 billion+ positions for better navigation algorithms. This talk should highlight how you can employ big data technology on commodity hardware and without spending a fortune on it.
More details on: http://2013.howtoweb.co/
2. What do I know about big data?
- skobbler logs all positions
from our users (100 billion+)
- > 10TB of data from users
- Products / revenues
significantly Improved with
Business Intelligence
Big data on a small budget
@apphil #2
3. Why should you learn about big data?
Harvard Business Review: “Data Scientist: The
Sexiest Job of the 21st Century”
Obama became president of the US in big parts
due to the use of big data…
World class sports teams enhance their
performance by big data
Amazon, Google, Facebook, etc. have all their devprocesses by now data-driven
Big data on a small budget
@apphil #3
4. What are some great use-cases for big
data?
Analyzing of log files
and user behavior (and
predictions about future
behavior)
A/B testing and
automatic optimization
of functionality
Improving monetization
(e.g. ad optimization,
etc.)
Checking adoption and
usage of new features
Big data on a small budget
@apphil #4
5. When better not to rely on big data?
When qualitative feedback is
better than quantitative one
(e.g. very early stage
companies)
When you don’t have
enough users yet to get
statistically relevant results
When you do not know what
you are optimizing for
Big data on a small budget
@apphil #5
6. How does a solid and simple workflow for
big data analysis look like?
Proces
s
Log
Analyse
Eval /
Test
Big data on a small budget
Improv
e
@apphil #6
7. Tools / technologies for a good big data
setup
Logging: MongoDB, VoltDB,
Cassandra
Processing & Analyzing /
Storing: Hadoop & Hbase
(batch), Storm (real-time),
Samza (real-time)
Optimizing: Mahout (machine
learning)
Big data on a small budget
@apphil #7
8. How can you build this without breaking
the bank?
- Analyse / process Async
- Cheap dedicated servers
(vs. cloud)
- Use Open / Free
Software
Big data on a small budget
@apphil #8
9. Key cost factor: Real-time, near-time vs.
batch
- Real-time much more
expensive than batch
- Leverage as much preprocessing as possible
- Try using in-memory
technology for realtime analytics
Big data on a small budget
@apphil #9
10. #1 Log: Initially as much data as feasible
should be logged so it’s available later
- Define interesting data
(rather log too much if
unsure)
- Upload / collect data
- Decide on real-time, neartime or batch processing in
the chain
Big data on a small budget
@apphil #10
11. #2 Process: Enhance the data and make it
as rich as possible and easy to query
- Move data to processing environment
- Run logged data through processing
chain so it can be queried
- Enhance the logged data with any
additional data available (e.g.
geography, social data, user data, etc.)
Big data on a small budget
@apphil
12. #3 Analyse: Cluster the data in meaningful
groups and compare it
Big data on a small budget
- Define Key performance
Indicators (KPI)
- Cluster data in a meaningful
way (e.g. by geography, time
of day, customer past
behaviour)
- Compare data vs. reference
sets
@apphil #12
13. #4 Improve: Learn from analysis where
your challenges are to optimize behavior
- Manually / Automatically adjust
features (e.g. lower prices in
certain regions, etc.)
- Develop A/B testing scenarios
and formulate improvement
theories
Big data on a small budget
@apphil #13
14. #5 Evaluate
Check if the KPIs
improve after applying
the changes
Accept changes that
improved your users
behavior / reject changes
that kept them the same
Define which additional
logs you might need to
better cluster / identify
behaviour
Go back to step #1
Big data on a small budget
@apphil #14
15. #1 Log: Practical example on how this
works at skobbler
Software version
Routing profile used
Device
Raw Positions
Geography (e.g. country)
Rating of the route (optional)
Destination reached (yes / no)
Etc.
Big data on a small budget
@apphil #15
16. #2 Process: Enhance and split the data
based on drives and segments
Combine the data on a per drive basis (= session)
Combine the data on a per segment basis (= how
fast are people driving on a street versus our
estimate)
Identify key behavior across the route (e.g. reroutings, etc.)
Big data on a small budget
@apphil #16
17. Example: Real time analysis with Twitter
Storm framework to detect road changes
Example visualization of
drives in last five
minutes (real-time)
Big data on a small budget
@apphil #17
18. Example: Historic driving patterns
(processed with Hadoop / HBase)
Big data on a small budget
@apphil #18
19. #3 Analyse: Try to see in which areas our
routing is not optimal
KPIs are:
Route rating (if given)
# of re-routings (the smaller the better)
Time to destination vs. estimation by routing
Cluster the data by
Routing algorithm (and parameters used)
Geography
Big data on a small budget
@apphil #19
20. #4 Improve: Come up with strategies to
improve routing experience based on data
For future routes improve the estimation on time
taken on a segment vs. time actually travelled
Alter routing parameters based on country specifics
to get better results (e.g. in Germany people drive
faster on the Autobahn)
Big data on a small budget
@apphil #20
21. #5 Evaluate: Deploy the changes and
compare them to reference data
- Deploy changes to production
and compare ratings / timings
vs. base values (~weekly)
- Verify if other parameters such
as usage, etc. also improve
Big data on a small budget
@apphil #21
22. Summary: Big data can drive big value but
stay affordable
Simple formula:
Log -> Process -> Analyze ->
Improve -> Evaluate
= Success
Big data on a small budget
@apphil #22
23. Thank you for your attention!
Get in Touch: philipp.kandal@skobbler.com
Phone: +49-172-4597015
Follow me on
.com/apphil