5. Data processing today
Data intensive application
Definition :
“We call an application data-intensive if data is its primary challenge—the
quantity of data, the complexity of data, or the speed at which it is changing—as
opposed to compute-intensive, where CPU cycles are the bottleneck.”
Martin Klepmann
6. Data processing today
Today apps needs :
❏ Store data (databases)
❏ Caches
❏ search data (search index)
❏ Asynchronously message handling (stream processing)
❏ batch processing
8. Spark, hadoop, MapReduce
Spark : main differences with Map Reduce
❏ Spark load most of the dataset in memory
❏ Implement cache mechanisms which reduce read from disk
❏ Is much faster than MapReduce : Job scheduling
❏ Does not implement any data distribution technology but
can run on top of hadoop clusters (HDFS )
30. Spark Streaming : Some examples
❏ Wordcount
❏ stateless operation, counting words for every batch
❏ Basic Error count
❏ stateless operation, using a filter : contains(“ERROR”)
❏ Cumulative Error count
❏ Stateful operation, errors from the beginning of the processing
❏ Windowed Errors counts
❏ Stateful operation, errors from the sliding window of time