How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services?
Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business.
Join Hortonworks and Informatica as we discuss:
- What is a data lake?
- The modern data architecture for a data lake
- How Hadoop fits into the modern data architecture
- Innovative use-cases for a data lake
15. Data Lake Processes
Mobile Apps
Transactions,
OLTP, OLAP
Social Media, Web Logs
Machine Device,
Scientific
Documents and Emails
9. Govern & enrich
with metadata
3. Stream real-time
data
8. Explore & validate
data
4. Mask
sensitive data
2. Replicate changed
data & schemas
Visualization
& Analytics
11. Subscribe to
datasets
Data
Integration
Hub
1. Load or archive
batch data
Data
Virtualization
5. Access customer
“golden record
MDM
Enterprise
Data Warehouse
10. Correlate real-time
events with historical
patterns & trends
6. Refine &
curate data
7. Move results
to EDW
17. Use-Case: CDR Processing
• Each job picks up a number of files containing Text CDRs
(Call Detail Records)
• First task is to merge partial call records
• Some records may be partial – ex. multiple records for a single
call due to a dropped line or switching cell towers
• Partial records need to be merged and total call time needs to be
added to duration for the merged record
• Partial records for a single call may reside in multiple files or be
included in different jobs.
• Incomplete partial records need to be reprocessed by
consecutive jobs
• Second task is to sort all processed CDRs by calling number
18. Input CDR File Example
These 3 numbers
uniquely identify a call Partial calls starts with
1 and end with 0
Some partial
records are
incomplete
Processed
completed records
are sorted by caller
19. Output CDR Files
Completed Calls
Partial Calls
Duration times are
added to the
merged records
Partial records are
merged into a single
completed record
Partial records will
be reprocessed
20. Logical Design
Partial records only
Separate partial records
from completed records
Completed
records only
Separate
incomplete and
complete partial
records
Select incomplete
partial records
Aggregate all
completed and
partial-completed
records
30. CDR Pipeline
Sort records
by Key Summarize
by Key Group
Filter by
Province ID
Filter by
Collection
Date
City Code
Lookup
Read Files
Write report
• Scenario – Filter records by Date, City and Province;
Aggregate and summarize records by a composite Key
32. Adding Transactional Source
• Scenario - Report website use (Facebook, Twitter, etc.)
by Age and by Postal Code
Read WAP
log records
Get MSISDN
and URL fields
Lookup Age,
Postal Code by
MSISDN
Count URL
frequency Calculate
percentages
34. Result
• Easily combine big data sources with transactional data
• Example – Report website use (Facebook, Twitter, etc.) by Age and by
Region
Look-up of
Age, Region
by MSISDN
CRM
EDW
Log
Files,
HDFS