4. BUT HOW ?
• These graphs are the end result of a process
• In
order get here there’s a few things you need to do and
explore
5. A WORD ON NON-NATIVE
APPROACHES
• Yes, you can
• map your document schema to a relational schema
• then export your data from MongoDB to a relational db
• and set up a cron job to do this every day
• then use your BI tool to map relational to “objects”
• and then Report and do Analytics
6. BUT THAT WOULD BE NO
FUN
• Analytics using Native Queries
•A simple process
7. PROCESS: NAIVE
• Take a sample document
• Develop query
• Put on chart
• Done !
• and a gold star from your boss !
8. PROCESS: REALITY
• Understand your schema
• multiple schema’s in single collection
• multiple collections / multiple data sources
• Iterate:
• define metric
• develop query and report on metrics
• understand and drill down or discard
• repeat
• Operationalize metrics: dashboard
• Dimensions
• Plotting
11. BUT ALSO:
• Schema’s can be Polymorphic
{
"name" : "Bob",
"location" : "us",
"games" : [{"game" : "WoW",
"duration" : 2910},
{"game" : "Tetris",
"duration" : 593}]
}
12. SO NOW WHAT ?
• Only report on common attributes
• probably missing the most recent / interesting data
13. SO NOW WHAT ?
• Write 2 programs, one for each schema
• 2 graphs / reports
• 2 programs writing to 1 graph (basically merging instance data in 2
places)
14. SO NOW WHAT ?
• Unify Schema
• deal with absent, null values
• translate(NULL, “EU”);
15. ITERATE
• total time and how many games people play in the us vs eu ?
17. SIDEBAR: WRITING
AGGREGATION QUERIES
• Prepare Data
• Extract relevant properties from collection documents
• Unwind sub collection if its document is contributing to aggregation
• Aggregate data
• determine the key (_id) on which the aggregates should be done
• name aggregates
• Project Data
• For final results
19. PREPARE
• Only use location and games:
{ $project : {
location : 1,
games: 1
}}
• Unwind games as properties of its documents are aggregated
over:
{ $unwind : "$games" }
20. AGGREGATE DATA
• Aggregate on number of games (add 1 per game)
and total duration (add duration per game)
using location as key
{ $group : {
_id : { location : 1},
number_games: { $sum : 1 },
total_duration: {$sum : "$games.duration"}
}}
21. PROJECT
• Only show location and aggregates, do not show _id
{ $project : {
_id : 0,
location : "$_id.location",
number_games : 1,
total_duration : 1
}}
22. RESULT 1
• People spend a little more time playing in the US
• More games played in the EU
24. CHALLENGE 2
• Since
we found EU and US play similar amount and same
number of games, new challenge is:
• Lets
see what the distribution of different
games is the 2 locations
34. QUERY
•2 aggregations happening at same time:
•1 by user
•1 by location
• This query needs to be broken up in several queries
• Fairly complex
• Currently easiest to process in Ruby/Java/Python/...
36. RESULT 3
• Bob plays >20% WoW in comparison to the Europeans, but
plays 200% more Tetris
37. A NOTE ON QUERIES
• There’s no notion of a declared schema
• The augmented scheme is coded in queries
• Reuse is very hard, happens at a query language
38. DIMENSIONS
• Most questions / graphs have a dimension
• Time, Geo
• Categories
• Relative: what’s X’s contribution of revenue to total
• Youwill need to be able to pass in dimensions as a
predicate for your queries
• or cache result and post process client-side
39. A WORD ON RENDERING
GRAPHS / REPORTS
• Several libraries available for ruby / python / java
• Gruff, Scruffy, StockCharts, D3, JRafael, JQuery Vizualize,
MooCharts, etc, etc.
• Also some services: John Nunemakers work (http://
get.gaug.es/)
• But Basically:
• you know how to program, right !
40. REVIEW
• Understand your schema
• multiple schema’s in single collection
• multiple collections / multiple data sources
• Iterate:
• define metric
• develop query and report on metrics
• understand and drill down or discard
• repeat
• Operationalize metrics: dashboard
• Dimensions
• Plotting
41. PUNCHLINES
• We have described a software engineering process
• but requirements will be very fluid
• When you know how to write ruby / java / python etc. - life is
good
• If you’re a business analyst you have a problem
• better be BFF with some engineer :)
42. PLUG
• We’ve been working on a declarative analytics product
• (initially) uses Excel as its presentation layer
• Reach out to me if you’re interested
@rogerb
roger@norellan.com