Slides of a talk at the International PHP Conference 2012 on how we successfully mastered the challenge to log everything and transport the logged data into different sinks for different needs.
2. Who we are.
Dr. Stefan Schadwinkel Mike Lohmann
Analytics Architektur
Author (heise.de, Cereb.Cortex, EJN, J.Neurophysiol.) Author (PHPMagazin, IX, heise.de)
Log everything 2
2
3. Agenda.
§ What we do. What we need to do. What we are doing.
§ Requirement: Log everything!
§ Infrastructure and technologies.
§ We want happy business users.
Log everything 3
3
5. Numberfacts of PokerStrategy.com
7.600.000
Requests/Day
PokerStrategy.com
Education since 2005
6.000.000 19 Languages
Registered Users
2.800.000 700.000
PI/Day Posts/Day
Log everything 5
5
6. Topics of this talk
- How to use existing technologies and standards. - Out of the box solution
- Scalability and simplicity of the solution - Ready to use scripts
- „Good enough“ for now!
- Showing way from requirement to solution.
- OpenSource Sf2 bundles for logging.
- Livedemo.
Log everything 6
6
7. What we do.
§ We teach Poker.
§ We create webapplications.
§ We serve millions of users in different countries respecting
a multitude of market rules.
§ We make business decisions driven by complex
data analytics.
Log everything 7
7
8. What we need to do.
§ We need to try out other teaching topics, fast.
§ We need to gather data from all of these „try outs“ to accumulate them
and build business decisions on their analysis.
§ We need a bigger infrastructure to gather more data.
§ We need to hire more (good) people! J
Log everything 8
8
9. What we are doing.
§ We build ECF (Education Community Framework).
§ We (can) log everything!
§ We (now) use Amazon S3 and Amazon EMR to have a scaling
storage and map reduce solution.
§ We hire (good) people! J
Log everything 9
9
10. Requirement: Log everything.
§ „Are you mad?!“
§ „Be more specific, please!“
§ „But what about the user‘s data?!“
Log everything 10
10
11. Logging Tools / Technologies
Producer Transport Storage Analytics
Symfony2 Now: Now: MapReduce
RabbitMQ S3 Storage Hive
Application Erlang Consumer Hadoop via
Server and Amazon BI via QlikView
Was: EMR
Databases
Flume
Was:
Virtualized Inhouse
Hadoop
15.10.12 11
11
14. Producer
§ LoggingComponent: Provides interfaces, filters and handlers
§ LoggingBundle: Glues all together with Symfony2
h=ps://github.com/ICANS/IcansLoggingComponent
h=ps://github.com/ICANS/IcansLoggingBundle
15.10.12 14
14
15. Transport – First Try
§ Hey, if we use Hadoop, why not use Flume?
- Part of the Ecosystem
- Central config
- Extensible via Plugins
- Flexible Flow Configuration
- How? : Flume Nodes à Flume Sinks
15.10.12 15
15
16. Transport – First Try
§ But, .. wait!
- Ecosystem? Just like Hadoop version numbers…
- Admins say: Central config woes!
- issues: multi-master, logical vs. physical nodes, Java heap
space, etc.
- Will my plugin run with flume-ng?
- Ever tried to keep your complex flow and switch reliability levels?
Read: Our admins still hate me …
15.10.12 16
16
17. Transport – Second Try
§ RabbitMQ vs. Flume Nodes
- Each app server has ist own local RabbitMQ
- The local RabbitMQ shovels ist data to a central RabbitMQ
cluster
- Similar to the Flume Node concept
- Decentralized config: Producers and consumers simply connect
15.10.12 17
17
18. Transport – Second Try
§ But, .. wait! We still need Sinks.
- Custom crafted RabbitMQ consumers
- We could write them in PHP, but ..
- Erlang, teh awesome!
- Battle-hardened OTP framework.
- „Let it crash!“ .. and recover.
- Hot code change. If you want.
Read: Runs forever.
15.10.12 18
18
19. Storage – First Try
§ Use out-of-the-box Hadoop (Cloudera)
§ But:
- Virtualized Infrastructure
- Unknown usage patterns
Hadoop
- Must be cost effective
- Major Hadoop version upgrades
15.10.12 19
19
20. Storage – Second Try
§ Use Amazon Webservices
§ Provides flexible virtualized infrastructure
§ Cost-effective storage: S3
Amazon S3
§ Hadoop on demand: EMR
15.10.12 20
20
21. Storage – Storage Amazon S3
§ Erlang RabbitMQ consumer simply copies the
incoming data to S3
- Easy: exchange „hadoop“ command with „s3cmd“
Amazon S3
15.10.12 21
21
22. Storage – Storage Amazon S3
§ S3 bucket receives many small, compressed log file chunks
§ Amazon provides s3DistCp which does distributed data copy:
- Aggregate many small files into partitioned large chunks
Amazon S3
- Change compression
15.10.12 22
22
23. Analytics
§ We want happy business users.
§ We want to answer questions.
- People want answers to questions they have. Now.
- No, they couldn‘t tell you that question yesterday. If they had
known, they would have already asked for the answer. Yesterday.
§ We also want data-driven applications.
- Production system analysis.
- Fraud prevention.
- Recommendations.
- Social metrics for our users.
15.10.12 23
23
24. Analytics
§ Remember MapReduce.
- Custom Jobs.
- Streaming: Use your favorite.
- Java API: Cascading. Use your favorite: Java, Groovy, Clojure,
Scala.
- Data Queries.
- Hive: similar to SQL.
- Pig: Data flow.
- Cascalog: Datalog-like QL using Clojure and Cascading.
15.10.12 24
24
25. Analytics
§ Cascalog is Clojure, Clojure is Lisp
(?<- (stdout) [?person] (age ?person ?age) … (< ?age 30))
Query Cascading Columns of „Generator“ „Predicate“
Operator Output Tap the dataset
generated
by the query
§ as many as you want
§ both can be any clojure function
§ clojure can call anything that is
available within a JVM
15.10.12 25
25
26. Analytics
§ We use Cascalog to preprocess and organize that incoming flow of log messages:
15.10.12 26
26
28. Analytics
§ After the Cascalog Query we have:
s3://[BUCKET]/icanslog/[WEBSITE]/icans.content/year=2012/month=10/day=01/part-00000.lzo
Hive
ParSSoning!
15.10.12 28
28
29. Analytics
§ Now
we
can
access
the
log
data
within
Hive:
15.10.12 29
29
30. Analytics
§ Now
we
can
run
Hive
queries
on
the
[WEBSITE]_icanslog_content
table!
§ But
we
also
want
to
store
the
result
to
S3.
15.10.12 30
30
32. Analytics
§ We can now simply copy the data from S3 and import in any local analytical tool, like:
- Excel (It must really make business people happy…)
- QlikView (Anyone can be happy with it…)
- R (If I want an answer…)
15.10.12 32
32
34. Contacts.
Dr. Stefan Schadwinkel Mike Lohmann
stefan.schadwinkel@icans-gmbh.com mike.lohmann@icans-gmbh.com
ICANS_StScha mikelohmann
15.10.12 34
34