At StampedeCon 2014, Ronald Indeck (VelociData), "Enabling Key Business Advantage from Big Data through Advanced Ingest Processing."
All too often we see critical data dumped into a “Data Lake” causing the data waters to stagnate and become a “Data Swamp”. We have found that many data transformation, quality, and security processes can be addressed a priori on ingest to enhance goodness and improve accessibility to the data. Data can still be stored in raw form if desired but this processing on ingest can unlock operational effectiveness and competitive advantage by integrating fresh and historical data and enable the full potential of the data. We will discuss the underpinnings of stream processing engines, review several relevant business use cases, and discuss future applications.
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Enabling Key Business Advantage from Big Data through Advanced Ingest Processing - StampedeCon 2014
1. Enabling Key Business Advantage from Big Data
through Advanced Ingest Processing
Ronald S. Indeck, PhD
President and Founder
VelociData, Inc.
Solving the Need for Speed in
Big DataOps
2. www.velocidata.com info@velocidata.com
Today’s Discussion
• Motivations for Advanced Processing
• Total Data Challenges
• Economical Parallelism for IT is Arriving
• Heterogeneous System Architectures (HSA)
• HSA Implementation and Business Benchmarks
• Questions
4. www.velocidata.com info@velocidata.com
The Urgency for Gaining Answers in Seconds
Companies that Embrace Analytics
Accelerate Performance
“Value Integrators” achieve higher
business performance:
‒ 20 times the EBITDA growth
‒ 50% more revenue growth
• “Large-scale data gathering and analytics are quickly becoming a new frontier of competitive
differentiation” – HBR
• The challenge for IT is to economically provide real time, quality data to support business analytics
and meet time-bound service level requirements when data are doubling every 12 months
Analytics is creating a
competitive advantage
4
5. www.velocidata.com info@velocidata.com
Recognizing “Total Data” Challenges
• Bloor: Databases are more than adequate for the use cases they are
designed to support
• Consider Big Data AND Relational, not OR … think “Total Data”
• The critical unsolved challenge is breaking Total Data flow bottlenecks
5
• Total Data challenges
• Data volumes exploding
• Data velocity and variety growing
• Data must quickly move between disparate systems
• Processing high volumes on mainframes is expensive
• No spare resources for critical encryption / masking
• Improving or measuring data quality is challenging
6. www.velocidata.com info@velocidata.com
Conventional Approaches
• Add more cores and memory to the existing platform
• Push processing into MPP (Teradata, Netezza, …)
• Change the infrastructure (Oracle Exadata, …)
• Use distributed platforms (Hadoop, ...)
These require new skills, time, capital, management, support,
risk … and fail to truly solve the Total Data flow problem
6
7. www.velocidata.com info@velocidata.com
Parallelism in IT Processing is Compelling
• Amdahl’s Law
• High Performance Computing history
• Systems were expensive
• Unique tools and training required
• Scaling performance is often sub-linear
• Issues with timing and thread synchronization
HPC has struggled for 40 years to deliver widespread accessibility mostly due
to cost and poor abstraction, development tools, and design environment
If we could just deliver accessibility at an affordable cost …
• Hardware is now becoming inexpensive
• Application development improvements still needed to enable productivity
Abstract through implementation of streaming as the paradigm
7
8. www.velocidata.com info@velocidata.com
Complementary Approach: Heterogeneous System Architecture
• Leverage a variety of compute resources
• Not just parallel threads on identical resources
• Right resources at the right times
• Functional elements use appropriate processing components where needed
• Accommodate stream processing
• Source processing target
• Streaming data model enables pipelining, data flow acceleration
• Embrace fine-grained pipeline / functional parallelism
• Especially data / direct parallelism
• Separate latency and throughput
• Engineered system
• Manage thread, memory, and resource timing and contention
8
9. www.velocidata.com info@velocidata.com
Heterogeneous System Architecture
General purpose “not bad at everything”
- Good branch prediction, fast access to large
memory
Thousands of cores performing very
specific tasks
- Excellent matrix and floating point
Fully customizable with extreme
opportunities for parallelism
- Excels at bit manipulation for regex,
cryptography, searching, …
9
Standard CPUs
Graphics Boards (GPUs)
FPGA Coprocessors
10. www.velocidata.com info@velocidata.com
• Compute “value at risk” for a portfolio
• 1024 stocks
• Evaluate using Monte Carlo simulation
• Brownian motion random walk
• Execute 1 million trials and aggregate results: 1 trial equals 1024
random walks
• Double-precision computation
Example: Risk Modeling Application
10
11. www.velocidata.com info@velocidata.com
Example: Risk Modeling Performance Results
• Baseline [CPU-only]
• 450 thousand walks/second 37 minutes to execute 1 billion walks
• FPGA + GPU + CPU
• 140 million walks/second 6 seconds for 1 billion walks
• Speedup of 370x
• Other financial MC simulations are similar
*First use of GPU, FPGA, and CPU in one application
application
stage 1
application
stage 2
application
stage 3
FPGA
graphics
engine
chip
multi-
processor
11
12. www.velocidata.com info@velocidata.com
• Bundles software, firmware, and hardware into an appliance
• Delivers the right compute resource (CPU, GPU, and FPGA) to the right
process at the right time
• Uses other system resources effectively
• High-level abstraction: no need to code, re-train, or acquire new skillsets
• Promotes stream processing for real-time action
• Sources processing targets
• Streaming data model enables pipelining for data flow acceleration
Stream Processing as an HSA Appliance
12
13. www.velocidata.com info@velocidata.com
Example: VelociData Solution Palette
17
VelociData
Suites
VelociData Solutions Examples Conventional
(records/second)
VelociData
(records/second)
Data
Transformation
Lookup and Replace
Data enrichment by populating fields from a master file,
dictionary translations, etc. (e.g. CP Cardiopulmonologist)
3000-6000 600,000
Type Conversions XML Fixed; Binary Char; Date/Time Formats 1000-2000 800,000
Format Conversions
Rearrange, add, drop, merge, split, and resize fields to change
layouts
1000-10,000 650,000
Key Generation Hash multiple field values into a unique key, (e.g. SHA-2) 3000-20,000 > 1,000,000
Data Masking
Obfuscate data for non-production uses: Persistent or Dynamic;
Format preserving; AES-256
500-10,000 > 1,000,000
Data Quality
USPS Address
Processing
Standardization, verification, and cleansing
(CASS certification in process)
600-2000 400,000
Domain Set Validation
Validate a value based on a list of acceptable values (e.g., all
product codes at a retailer; all countries in the world)
1000-3000 750,000
Field Content Validation
Validates based on patterns such as emails, dates, and phone
numbers
1000-3000 > 1,000,000
Data type validation and bounds checking 3000-6000 > 1,000,000
Data Platform
Conversion
Mainframe Data
Conversion
Copybook parsing & data layout discovery; EBCDIC, COMP,
COMP-3, … ASCII, Integer, Float,…
200-800 > 200,000
Data Sort Accelerated Data Sort
Sort data using complex sort keys from multiple fields within
records
7000-20,000 1,000,000
Results are system dependent but data intended to provide magnitude comparison
14. www.velocidata.com info@velocidata.com
Example of Common ETL Bottlenecks
Task #1
Task #2
Task #3
Task #4
Task #5
Task #6
Task #7
Task #8
Staging DB
ETL Server
Candidates for
Acceleration
Extract Transform Load
CSV
Mainframe
XML
RDBMS
Social Media
Sensor
Hadoop
• Hadoop
• ETL Server
• Data Warehouse
• Database Appliances
• BI Tools
•Cloud
15. www.velocidata.com info@velocidata.com
Example ETL Processes Offloaded
15
Task #6
Task #7
Task #8
Staging DB
ETL Server
Extract Transform Load
Keep Existing Input
Interfaces
Remove
Bottlenecks
Reduce ETL Server
Workload
Faster Total
Processing Time
CSV
Mainframe
XML
RDBMS
Social Media
Sensor
Hadoop
Task #1
Task #2
Task #3
Task #4
Task #5
• Hadoop
• ETL Server
• Data Warehouse
• Database
Appliances
• BI Tools
•Cloud
16. www.velocidata.com info@velocidata.com
Example Mainframe-to-Hadoop Workflow
• Simple, configuration-driven workflow
• Sample shows Mainframe HDFS
• Data are validated, cleansed, reformatted, enriched, …, along the way
• Enables landing analytics-ready data as fast as it can move
across the wire
• Workflow can also work in reverse to return processed data to
the mainframe
16
Mainframe
Input
Validation Key Generation Formatter Lookup Address
Standardization CSV Out
17. www.velocidata.com info@velocidata.com
Wire-rate Platform Integration
17
Enable fast data access between systems
MPP Platforms (e.g., Teradata)
Format and improve data for ready
insertion into Data Analytics
architectures ETL Server
Preprocess data for fast movement
into and out of Data Integration tools
Mainframe
Conversion into and out
of EBCDIC and packed
decimal formats
Hadoop
Convert data to ASCII and
improve quality in flight
VelociData
feeds Hadoop
pre-processed,
quality data for
real-time BI efforts
VelociData
enables real-time
data access by
Teradata for
operational
analytics
18. www.velocidata.com info@velocidata.com
Enabling Three Layers of Data Access
VelociData delivers Hadoop
pre-processed, quality data to
keep “the lake” clean
Hadoop
VelociData enables real-time
data access for immediate
analytics and visualization
VelociData feeds databases and
warehouses pre-analytic, aggregated
data for operational analytics
• Sensors
• Weblogs
• Transactions
• Mainframe
• Hadoop
• Social Media
• RDBMS
• …
Wire-rate transformations and convergence of fresh and historical data
19
19. www.velocidata.com info@velocidata.com
Accessing Realtime and Historical Data
• Realtime Analysis for Competitive
Advantage
• Enabling the speed of business to match
business opportunities
• Integrating Historical Data for
Operational Excellence
• Informing traditional BI with real-time inputs
19
Conventional Batch-oriented BI
Real-time Operational Analytics
Iterative Modeling
Business Excellence
20. www.velocidata.com info@velocidata.com
Stream Processing AND Hadoop
Leveraging stream processing with batch-oriented Hadoop
• Access to more data for analytics
• Process data on ingest (also land raw data if desired)
• Transformation
• Cleansing
• Security
• Never read a COBOL copybook again
• Stream sort for integrating data, aggregation, and dedupe
• …
20
21. www.velocidata.com info@velocidata.com
Examples of Data Challenges Being Solved
21
• Pharmaceutical discovery query is reduced from 8 days to 20 minutes
• Retailer now integrates full customer data from in-store, on-line, and mobile sources in
real-time (processing 50,000 records/s, up from 100/s)
• Property casualty company shortens by five-fold a daily task of processing 540 million
records to enable more accurate real-time quoting
• Credit card company reduces mainframe costs and improves analytics performance by
integrating historical and fresh data into Hadoop at line rates
• Financial processing network masks 5 million fields/s of production data to sell
opportunity information to retailers
• To enable better customer support, a health benefits provider shortens a data
integration process from 16 hours to 45 seconds
• Billions of records with multi-fields keys are sorted nearly a million records/s for
analytics and data quality
• USPS address standardization at 10 billion/hour for data cleansing on ingest