These slides were designed for Apache Hadoop + Apache Apex workshop (University program).
Audience was mainly from third year engineering students from Computer, IT, Electronics and telecom disciplines.
I tried to keep it simple for beginners to understand. Some of the examples are using context from India. But, in general this would be good starting point for the beginners.
Advanced users/experts may not find this relevant.
2. Agenda
● What is big data?
● Data at rest Vs Data in motion
● Batch processing Vs Real - time data
processing (streaming)
● Examples
● When to use: Batch? Real-time?
● Current trends
2
4. Definition : big data
Big data is high-volume, high-velocity and/or
high-variety information assets that demand
cost-effective, innovative forms of information
processing that enable enhanced insight,
decision making, and process automation. [1]
4
5. Exploding sizes of datasets
5
● Google
○ >100PB data everyday [3]
● Large Hydron collidor :
○ 150M sensors x 40M sample per sec x 600 M
collisions per sec
○ >500 exabytes per day [2]
○ 0.0001% of data is actually analysed
7. Data at rest Vs Data in motion
● At rest :
○ Dataset is fixed
○ a.k.a bounded [15]
● In motion :
○ continuously incoming data
○ a.k.a unbounded
7
8. Data at rest Vs Data in motion (continued)
● Generally Big data has velocity
○ continuous data
● Difference lies in when are you analyzing
your data? [5]
○ after the event occurs ⇒ at rest
○ as the event occurs ⇒ in motion
8
9. Examples
● Data at rest
○ Finding stats about group in a closed room
○ Analyzing sales data for last month to make
strategic decisions
● Data in motion
○ Finding stats about group in a marathon
○ e-commerce order processing
9
12. Batch processing : Use-cases
12
● Sales summary for the previous
month[5]
● Model training for Spam emails
13. Batch processing : Characteristics
13
● Access to entire data
● Split decided at the launch time.
● Capable of doing complex analysis (e.g.
Model training) [6]
● Optimize for Throughput (data processed
per sec)
● Example frameworks : Map Reduce,
Apache Spark [6]
15. Real time data processing
● a.k.a. Stream processing
● Problem statement :
○ Process incoming stream of data
○ to give answer for X at this
moment.
15
16. Stream processing : Use-cases
● e-commerce order processing
● Credit card fraud detection
● Label given email as : spam vs non-
spam
16
18. Stream processing : Characteristics
● Results for X are based on the
current data
● Computes function on one record or
smaller window. [6]
● Optimizations for latency (avg. time
taken for a record)
18
19. Stream processing : Characteristics
● Need to complete computes in near real-
time
● Computes something relatively simple e.g.
Using pre-defined model to label a record.
● Example frameworks: Apache Apex,
Apache storm
19
24. Micro-batch
● Create batch of
small size
● Process each
micro-batch
separately
● Example
frameworks: Spark
streaming
pani puri ⇒ micro-batch
image ref [10]
24
25. ● Depends on use-case
○ Some are suitable for batch
○ Some are suitable for streaming
○ Some can be solved by any one
○ Some might need combination of two.
25
When to use : Batch Vs Streaming?
26. When to use : Batch Vs Real time?(continued)
● Answers for current snapshot ⇒ Real-time
○ Answers at the end ⇒ Open
● Complex calculations, multiple iterations
over entire data ⇒ Batch
○ Simple computations ⇒ Open
● Low latency requirements (< 1s) ⇒ Real-
time
26
27. When to use : Batch Vs Real time?(continued)
● Each record can be processed
independently ⇒ Open
○ Independent processing not possible ⇒
Batch
● Depends on use-case
○ Some use-cases can be solved by any one
○ Some other might need combination of two.
27
29. Can one replace the other?
● Batch processing is designed for ‘data at
rest’. ‘data in motion’ becomes stale; if
processed in batch mode.
● Real-time processing is designed for ‘data
in motion’. But, can be used for ‘data at
rest’ as well (in many cases).
29
31. Quiz : is this Batch or Real-time?
● Queue for roller coaster
ride image ref [11]
● Queue at the petrol
pump image ref [12]
31
32. Quiz : is this Batch or Real-time?
● Selecting relevant ad
to show for requested
page
● Courier dispatch from
city A to B
image ref [13]
image ref [14]
32
34. Current trends
● Difficulty in splitting problems as Map
Reduce : Alternative paradigms for
expressing user intent .
● More and more use-cases demanding
faster insight to data (near real-time)
● ‘Data in motion’ is common.
● ‘Real-time data processing’ getting
traction.
34
37. References
1. Big Data | Gartner IT Glossary http://www.gartner.com/it-glossary/big-data/
2. Big Data | Wikipedia https://en.wikipedia.org/wiki/Big_data
3. Data size estimates | Follow the data https://followthedata.wordpress.com/2014/06/24/data-size-estimates/
4. Data Never Sleeps 2.0 | Domo https://www.domo.com/blog/2014/04/data-never-sleeps-2-0/
5. Data in motion vs. data at rest | Internap http://www.internap.com/2013/06/20/data-in-motion-vs-data-at-rest/
6. Difference between batch processing and stream processing | Quora https://www.quora.com/What-are-the-differences-between-batch-
processing-and-stream-processing-systems/answer/Sean-Owen?srid=O9ht
7. How FAST is Credit Card Fraud Detection | FICO http://www.fico.com/en/latest-thinking/infographic/how-fast-is-credit-card-fraud-
detection
8. CULINARY TERMS | panjakhada http://panjakhada.com/the-basics/
9. Crispy Chaat ... | grabhouse http://grabhouse.com/urbancocktail/11-crispy-chaat-joints-food-lovers-hyderabad/
10. Paani puri stall | citiyshor http://www.cityshor.com/pune/food/street-food/camp/murali-paani-puri-stall/
11. Great Inventions: The Roller Coaster | findingdulcinea http://www.findingdulcinea.com/features/science/innovations/great-inventions/the-
roller-coaster.html
12. RIL petrol pump network | economictimes http://articles.economictimes.indiatimes.com/2015-05-24/news/62583419_1_petrol-and-
diesel-fuel-retailing-ril
13. Publishers | Propellerads https://propellerads.com/publishers/
14. Michael Bishop Couriers | Google plus https://plus.google.com/110684176517668223067
15. The world beyond batch: Streaming 101 http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-101.html
16. How to Answer the Question http://www.clipartpanda.com/clipart_images/how-to-answer-the-question-46954146
17. Thank You http://www.planwallpaper.com/thank-you
37