What priorities are driving big data implementations? What challenges are companies running into? What are big data implementations being used for? Are people seeing the benefits they expected?
Annually, we send out a survey to find out what is on the minds of people either piloting a Hadoop or Spark program, or deep in the thick of it. Almost 200 professionals from a variety of roles — data scientists, CTO’s, developers, architects and IT managers — all weighed in. They let us know what matters to them when it comes to the big data world. View this webinar on-demand to see what we learned.
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
2018 Big Data Trends: Liberate, Integrate, and Trust Your Data
1. 2018 Big Data Trends:
Liberate, Integrate, and Trust Your Data
Paige Roberts, Big Data Product Marketing Manager
2. Today’s Speaker
2Syncsort Confidential and Proprietary - do not copy or distribute
Product Marketing Manager
•DMX/DMX-h
•DataFunnel™
•DMX Change Data Capture
Paige Roberts
3. Agenda
Who is Syncsort and Why Did We Do This Survey
Big Picture on Big Data
Who Participated in the Big Data Trends Survey
5 Big Data Trends and How Syncsort Addresses Them
– 1. More Enterprise Data Flows Into the Data Lake
– 2. Data Quality Moves to Center Stage
– 3. Data Governance Expands
– 4. Data Lakes Stay Fresher
– 5. Big Data: Stronger than Ever
How Syncsort Addresses These Trends
Questions
3Syncsort Confidential and Proprietary - do not copy or distribute
5. Syncsort: Trusted Industry Leadership
Syncsort Confidential and Proprietary - do not copy or distribute 5
500+
Experienced & Talented
Data Professionals
>7,000
Customers
1968
50 Years of Market Leadership
& Award-Winning Customer Support
84
of Fortune 100 are Customers
3x
Revenue Growth
In Last 12 Months
The global leader in Big Iron to Big Data
6. Use Cases & Strategic Partnerships
Syncsort Confidential and Proprietary - do not copy or distribute
Data
Infrastructure Optimization
• Mainframe Optimization
• Application Modernization
• EDW Optimization
• Cross-Platform Capacity
Management
Data
Availability
• High Availability & Disaster
Recovery
• Mission-Critical Migration
• Cross-Platform Data Sharing
• IBM i Data Security & Audit
• Mainframe Access &
Integration for Machine Data
• Mainframe Access &
Integration for App Data
• High-performance ETL
Data
Integration
Data
Quality
• Data Governance
• Customer 360
• Big Data Quality & Integration
• Data Enrichment & Validation
Big Iron to Big Data
A fast-growing market segment composed of solutions that optimize traditional data systems and
deliver mission-critical data from these systems to next-generation analytic environments.
6
8. Advantages of the Modern Big Data Architecture
8Syncsort Confidential and Proprietary - do not copy or distribute
9. What do customers want to use their Hadoop clusters for?
9Syncsort Confidential and Proprietary - do not copy or distribute
1.ETL
2.Analytics*
3.Data Blending
4.Active Archive
5.EDW / Mainframe
Optimization
10. Implementation Challenges
10Syncsort Confidential and Proprietary - do not copy or distribute
1. Data Quality: Assessing and improving quality of
data as it enters and/or in the data lake.
2. Skills/Staff: Need to learn a new set of skills,
Hadoop programmers are difficult to find and/or
expensive.
3. Data Governance: Including data lake in
governance initiatives and meeting regulatory
compliance.
4. Rapid Change: Frameworks and tools evolve fast,
and it’s difficult to keep up with the latest tech.
5. Fresh Data (CDC): Difficult to keep data lake up-to-
date with changes made on other platforms.
6. Mainframe: Difficult to move mainframe data in
and out of Hadoop/Spark.
7. Data Movement: Difficult to move data in and out
of Hadoop/Spark.
0
5
10
15
20
25
30
35
40
45
% of People Who Consider this a Top Challenge (Rated 1 or 2)
Big Data Challenges
Data Quality Skills Governance
Rapid Change CDC Mainframe
Data Movement Cost Connectivity
12. Who Participated in the Big Data Trends Survey?
12Syncsort Confidential and Proprietary - do not copy or distribute
Main Industries
Represented:
1. Financial Services
2. Healthcare
3. Information Services
4. Government
5. Retail
6. Insurance
1. Data Architects
2. Developers
3. IT Managers
4. Data Scientists
5. Variety of other roles
15. Implementation Challenges
15Syncsort Confidential and Proprietary - do not copy or distribute
1. Data Quality: Assessing and improving quality of
data as it enters and/or in the data lake.
2. Skills/Staff: Need to learn a new set of skills,
Hadoop programmers are difficult to find and/or
expensive.
3. Data Governance: Including data lake in
governance initiatives and meeting regulatory
compliance.
4. Rapid Change: Frameworks and tools evolve fast,
and it’s difficult to keep up with the latest tech.
5. Fresh Data (CDC): Difficult to keep data lake up-to-
date with changes made on other platforms.
6. Mainframe: Difficult to move mainframe data in
and out of Hadoop/Spark.
7. Data Movement: Difficult to move data in and out
of Hadoop/Spark.
0
5
10
15
20
25
30
35
40
45
% of People Who Consider this a Top Challenge (Rated 1 or 2)
Big Data Challenges
Data Quality Skills Governance
Rapid Change CDC Mainframe
Data Movement Cost Connectivity
16. What data do people need to get into their Hadoop clusters?
16Syncsort Confidential and Proprietary - do not copy or distribute
1. Relational Databases
2. Enterprise Data Warehouses
3. NoSQL Databases and Third Party Data
4. Cloud repositories
5. Mainframe data
6. Web / Mobile / Social Media data
7. AIX Power Systems and IBM I data
8. Machine / Sensor data
69% RDMS
46%
Enterprise Data Warehouse
45%
41%
32%
30%
30%
0.5%
18%
62%
NoSQL Databases
Files from Third Party Data, Providers or Partners
Cloud Repositories
Mainframe
Web/Mobil/Social Media
AIX Power Systems
Machines/Sensors
Other
IBM i
16 %
17. How Valuable is Mainframe and IBM i Data in a Data Lake?
17Syncsort Confidential and Proprietary - do not copy or distribute
Over 97% of respondents with mainframes believe its
valuable to access and integrate that data in the data lake.
Over 90% of organizations that have IBM i say it is valuable
to integrate that data with Hadoop.
18. Populating the Data Lake with Progressive
• Easily access and integrate
operational data, such as
Claims Liability, Policy,
Customer and Incident data,
for advanced analytics.
• Fill Hortonworks Data Lake
with 500+ tables from
Mainframe DB2, Oracle and
SQL Server, for cost-effective
storage and analytics
• Track day-to-day changes in
the data
Challenge Solution
• DMX DataFunnel easily and
quickly ingested all database
tables with the click of a
button
• DMX-h used on Hortonworks
Data Platform cluster to
determine daily changes from
both full and incremental data
files
• Simplicity: Single tool to
ingest, detect changes and
populate the data lake
• Faster Development &
Implementation:
DataFunnel ingested data
much faster than using open
source tools.
• Skills: Developers don’t need
in-depth knowledge of
Hadoop
• Insight: Better analytics with
readily-accessible
operational data
• Compliance –Ability to build
audit trails & keep the EDW
current
• Agility: Reclaim
development time by
automating, optimizing and
future-proofing
development
• Costs: Lower archival costs
The Progressive Group of Insurance Companies lives up to its name by being one step ahead of the insurance industry,
innovating with the latest technology to make it easy to understand, buy and use auto insurance. They began offering
the first drive-in claims office in 1937, pioneered online auto insurance policy sales in 1997, and customize premiums
based on customer’s actual driving patterns. Progressive has been recognized as a top business technology innovator by
InformationWeek 17 years in a row.
Benefit Business Value
20. Implementation Challenges
20Syncsort Confidential and Proprietary - do not copy or distribute
1. Data Quality: Assessing and improving quality of
data as it enters and/or in the data lake.
2. Skills/Staff: Need to learn a new set of skills,
Hadoop programmers are difficult to find and/or
expensive.
3. Data Governance: Including data lake in
governance initiatives and meeting regulatory
compliance.
4. Rapid Change: Frameworks and tools evolve fast,
and it’s difficult to keep up with the latest tech.
5. Fresh Data (CDC): Difficult to keep data lake up-to-
date with changes made on other platforms.
6. Mainframe: Difficult to move mainframe data in
and out of Hadoop/Spark.
7. Data Movement: Difficult to move data in and out
of Hadoop/Spark.
0
5
10
15
20
25
30
35
40
45
% of People Who Consider this a Top Challenge (Rated 1 or 2)
Big Data Challenges
Data Quality Skills Governance
Rapid Change CDC Mainframe
Data Movement Cost Connectivity
21. Big Data deemed untrustworthy by business managers/leaders
21Syncsort Confidential and Proprietary - do not copy or distribute
Only 33% of senior execs
have a high level of trust
in the accuracy of their
Big Data analytics.
~ KPMG 2016
85% of global execs say
major investments are
needed to update existing
data platform, including data
cleaning and consolidating.
~ Bain 2015
59% of global execs do not
believe their company has
capabilities to generate
meaningful business
insights from their data.
~ Bain 2015
22. Three Insights on Data Quality in Big Data Architectures
The greater the diversity of data, the greater the
need for data quality processes.
– Over 60% of respondents said storing enterprise-
wide data was critical to supporting their business.
– Respondents cited an average of four sources each.
– Respondents who identified five or more sources
were 4X as likely to name data quality as a critical
factor in a successful data lake implementation.
22Syncsort Confidential and Proprietary - do not copy or distribute
Financial services and insurance industries are the most focused on data quality and governance.
– Highly regulated industries, with high cost of non-compliance.
– 60% in these industries named data quality as most critical compared to 40% in other industries.
Not everyone is making the connection between quality and business benefits.
– 70% of respondents who did not include data quality as a top priority put Advanced/Predictive Analytics as their top
use case.
– Increased reliance of executives on Analytics insights should go hand-in-hand with trusted, high quality data.
23. Washing Out Money Laundering at a Large UK-Based Bank
• Selected BAE Systems’
NetReveal as new Anti-Money
Laundering (AML) solution,
operating on a Hadoop data
lake.
• Hadoop functionality was key
to meeting next-gen AML
transaction monitoring and
FCA compliance demands
using an efficient, inexpensive
distributed architecture.
• Needed a new data quality
solution for party/entity
matching in Hadoop to
support its new Anti-Money
Laundering solution.
• Trillium Quality for Big Data
was selected after a
competitive RFP process as
solution of choice for
party/entity matching in the
data lake.
• Proven speed and
performance in Hadoop
using integrated DMX-h
Intelligent eXecution
functionality.
• Ability to leverage existing
Trillium Software System
skills; i.e, visual creation of
data quality jobs.
• Proven domain expertise.
TSS is in active use elsewhere
in the company. The Trillium
team also showed its domain
expertise, such as proper
SWIFT processing.
• Native processing of data
quality jobs within Hadoop
“financial crimes database”
at high performance and
massive scale.
• Will support AML
compliance for many years
to come.
Business Challenge Solution Benefit Business Value
A UK-based bank serving over 30 million customers, providing current (checking) accounts,
savings, personal loans, credit cards and mortgages. Employing over 75,000 people, this bank funds
a large percentage of UK new-build properties and lends to many first-time UK home buyers.
25. Implementation Challenges
25Syncsort Confidential and Proprietary - do not copy or distribute
1. Data Quality: Assessing and improving quality of
data as it enters and/or in the data lake.
2. Skills/Staff: Need to learn a new set of skills,
Hadoop programmers are difficult to find and/or
expensive.
3. Data Governance: Including data lake in
governance initiatives and meeting regulatory
compliance.
4. Rapid Change: Frameworks and tools evolve fast,
and it’s difficult to keep up with the latest tech.
5. Fresh Data (CDC): Difficult to keep data lake up-to-
date with changes made on other platforms.
6. Mainframe: Difficult to move mainframe data in
and out of Hadoop/Spark.
7. Data Movement: Difficult to move data in and out
of Hadoop/Spark.
0
5
10
15
20
25
30
35
40
45
% of People Who Consider this a Top Challenge (Rated 1 or 2)
Big Data Challenges
Data Quality Skills Governance
Rapid Change CDC Mainframe
Data Movement Cost Connectivity
26. Data Quality & Data Governance Work Together
26Syncsort Confidential and Proprietary - do not copy or distribute
The processes
that help ensure
data is
understood,
corrected and
monitored to
ensure TRUST and
COMPLIANCE.
Collection of
practices and
processes which
help ensure the
formal
management of
data assets within
an organization.
DATA QUALITY DATA GOVERNANCE
Data Governance vs Data Quality: Managing Data-Driven Solutions. www.dataversity.com
Data Availability
Data
Compliance
Defining Key
Data Elements
Assigning
Data
Stewards
Data Consistency
Data Cleansing
Enrichment
Monitoring
Standardization
Defining Policies
Consistent
Analytics,
Metrics &
Reporting
Parsing
Matching
Discovery &
Profiling
Data Lineage
27. Data Quality Processing for Compliance
27Syncsort Confidential and Proprietary - do not copy or distribute
Cleanse data while improving contextual
understanding:
Parse data values from unstructured fields into
useful, usable new attributes.
Verify and enrich global postal addresses.
Standardize values for matching and linking.
Enrich data with external, third-party sources to
create comprehensive, unified records.
Link records spanning multiple sources of personal
data related to same customer.
29. Implementation Challenges
29Syncsort Confidential and Proprietary - do not copy or distribute
1. Data Quality: Assessing and improving quality of
data as it enters and/or in the data lake.
2. Skills/Staff: Need to learn a new set of skills,
Hadoop programmers are difficult to find and/or
expensive.
3. Data Governance: Including data lake in
governance initiatives and meeting regulatory
compliance.
4. Rapid Change: Frameworks and tools evolve fast,
and it’s difficult to keep up with the latest tech.
5. Fresh Data (CDC): Difficult to keep data lake up-
to-date with changes made on other platforms.
6. Mainframe: Difficult to move mainframe data in
and out of Hadoop/Spark.
7. Data Movement: Difficult to move data in and out
of Hadoop/Spark.
0
5
10
15
20
25
30
35
40
45
% of People Who Consider this a Top Challenge (Rated 1 or 2)
Big Data Challenges
Data Quality Skills Governance
Rapid Change CDC Mainframe
Data Movement Cost Connectivity
30. Keeping the Data Lake Fresh: Even Harder Than You Think
30Syncsort Confidential and Proprietary - do not copy or distribute
Keeping data in the data lake fresh is
difficult, especially when the source is
mainframe data.
Transactional sources change with each
transaction – often millions per day.
Each source has its own way of tracking
data changes.
Some Hadoop targets such as Hive don’t
even support fast updating.
31. Mastering Data Assets with Guardian
Guardian Life Insurance has 150 years of protection solutions, a long history of strong, successful customer relationships, and 20 years in the Fortune 500 list.
Guardian uses state-of-the-art technology to drive awareness and engagement for optimal results. Flexible funding options to meet each customers’ unique needs,
fast and accurate claims and long-term financial strength have led to award winning, customer-focused service.
“We found DMX-h to be very usable and easy to ramp up in terms of skills. Most of all, Syncsort has been a very good partner in terms of
support and listening to our needs.” – Alex Rosenthal, Enterprise Data Office
“Syncsort’s DataFunnel™ has been a powerful tool in our data lake strategy. We were able to ingest into Hadoop over 800 tables from one
source system … with one press of the button.”
• Include mainframe data in
comprehensive data-as-a-service
for internal self-service analytics.
• Ingest to HDFS hundreds of
mainframe DB2 tables, hundreds
of Oracle tables and 11 VSAM data
sets
• Time-to-market for analytics
projects was unacceptable (6-12
months), not repeatable
• 100TB of DB2/z data to monitor
for changes. Batch CDC couldn't
keep it current fast enough.
• DMX-h to easily load VSAM data to
HDFS; connect, transfer and
translate data
• DMX DataFunnel to quickly and
easily load over 800 tables from
DB2 and Oracle
• Migrated 49 COBOL and 14 JCL
jobs from the mainframe to DMX-h
• DMX CDC grabs delta changes in
real-time and pushes directly to
Hive.
• Hard-to-access Mainframe data all
included for comprehensive
analytics
• Simplified transformation
processes and reused data assets
• Hundreds of man/hours saved
• 1.4 terabytes of Oracle data
loaded in 3.5 hours
• No 3rd party software installed on
the mainframe
• Shorten time-to-market for data
and analytics projects
• Centralized standardized reusable
data assets that are searchable,
accessible and managed
• Increased ease of self-service
customized report building &
dashboarding
• 50 different business applications
depend on this data. This data is
now better managed, more
current, and their analytics output
– more trustworthy.
Business Challenge Solution Benefit Business Value
33. Benefits Businesses are Actually Getting from Big Data
33Syncsort Confidential and Proprietary - do not copy or distribute
Increase Productivity
Reduce Costs
Next-Gen Analytics
Increase Revenue
and Growth
Archive Data
Increase Agility
Get More for EDW/ Mainframe
Investment
Retain Data for Compliance
Free Mainframe Resources and
Reduce Costs
34. Insurance Company Moves Historical Data to Azure Cloud
34Syncsort Confidential and Proprietary - do not copy or distribute
One year of sales data available to key business apps, data
stored on expensive DASD storage.
97 TB of historical data stored on unreadable, inaccessible
virtual tape.
No access of key business applications to historical data on a
daily basis. Syncsort MFX could run several jobs to access that
data in a few WEEKS if it was needed for a quote, etc.
Syncsort MFX converted virtual tape to mainframe variable.
Syncsort DMX used over 300 copybooks to translate mainframe
variable data into human readable text, and remove duplicates.
Microsoft Azure Data Import Service put all 97 TB in Cloudera
CDH in the Azure cloud.
Key business applications moved to the Cloud.
All sales data encrypted securely in the Cloud.
Applications have instant access to all 97 TB of historical data.
Before
Current data on expensive mainframe DASD.
Older data on inaccessible virtual tape.
After
with MFX, DMX & Azure
Cloud App
Gives quotes, reports sold cases,
and rejects in seconds.
Instant access to all data.
Virtual Tape
18 Years of
Sales Data
Mainframe
1 Year of
Sales Data
NO
ACCESSMainframe App
Does quotes.
Checks sold
cases, rejects.
36. Syncsort Helps You Beat the Challenges of Big Data
36Syncsort Confidential and Proprietary - do not copy or distribute
• Get mainframe data into Hadoop easily,
in Hadoop format, or even original
mainframe format.
• Secure, govern, manage and
monitor the entire process
• Bridge the Big Iron to Big Data
skills gap
• Reduce development time from
weeks to days
37. Get Your Database data into Hadoop, At the Press of a Button
37Syncsort Confidential and Proprietary - do not copy or distribute
DMX
DataFunnel™
• Funnel hundreds of tables at once into your data lake
‒ Extract, map and move whole DB schemas in one invocation
‒ Extract from DB2, Oracle, Teradata, Netezza, S3, Redshift …
‒ To SQL Server, Postgres, Hive, Redshift and HDFS
‒ Automatically create target tables
• Process multiple funnels in parallel on edge node or data nodes
‒ Order data flows by dependencies
‒ Leverage DMX-h high performance data processing engine
• Filter unwanted data before extraction
‒ Data type filtering
‒ Table, record or column exclusion / inclusion
• In-flight transformations and cleansing
38. Trillium Quality for Big Data
38Syncsort Confidential and Proprietary - do not copy or distribute
Intelligent Execution enables deployment to Hadoop MapReduce and Spark.
Verify and enrich global postal addresses using global postal reference sources.
Enrich data from external, third-party sources to create comprehensive, unified
records, enabling 360-degree views of the customer and other key business entities.
Identify records that belong to the same domain (i.e., household or business).
Parse data values to their correct fields and standardize for better matching.
Match like records and eliminate duplicates.
Easily Create Data Quality Workflows on Hadoop Without MapReduce or Spark Coding
39. Syncsort Enables Governance
39Syncsort Confidential and Proprietary - do not copy or distribute
Metadata and data lineage for Hive, Avro and Parquet through HCatalog
Metadata lineage from DMX/DMX-h, Trillium Quality for Big Data
– Simplify audits, analytics dashboards, metrics
– Run-time job metadata and lineage REST API
– Integrate with enterprise metadata repositories like ASG
Cloudera Navigator certified integration
– Extends HCatalog metadata
– HDFS, YARN, Spark and other metadata
– Business and structural metadata
– Audit and track data from source to cluster
Apache Atlas ingestion lineage integration
– Audit and track data from source to cluster
– Detailed field level lineage
40. Syncsort Real-Time Change Data Capture
40Syncsort Confidential and Proprietary - do not copy or distribute
Keep data in sync in real-time:
Without overloading networks.
Without affecting source database performance.
Without coding or tuning.
• HDFS
• Hive
• IBM DB2
• IBM Informix
• Oracle
• Oracle RAC
• Sybase
• MS SQL Server
• Teradata
• MySQL
• PostgreSQL
• IBM DB2
• IBM Informix
• Oracle
• Oracle RAC
• Sybase
• MS SQL Server
Dependable – Reliable transfer of
data even if connectivity fails on
either side.
Fast – Captures changes in source
as they happen. Updates table
statistics for faster queries.
Flexible – Writes to HDFS, all Hive
tables, including those backed by
text, ORC, Parquet or Avro, and most
major RDBMSs.
Even updates Hive versions that
don’t support updates.
Real-Time Replication with Transformation
Conflict Resolution, Collision Monitoring, Tracking and Auditing
41. Implementation Challenges
41Syncsort Confidential and Proprietary - do not copy or distribute
0
5
10
15
20
25
30
35
40
45
% of People Who Consider this a Top Challenge (Rated 1 or 2)
Big Data Challenges
Data Quality Skills Governance
Rapid Change CDC Mainframe
Data Movement Cost Connectivity
1. Data Quality: Assessing and improving quality of
data as it enters and/or in the data lake.
2. Skills/Staff: Need to learn a new set of skills,
Hadoop programmers are difficult to find and/or
expensive.
3. Data Governance: Including data lake in
governance initiatives and meeting regulatory
compliance.
4. Rapid Change: Frameworks and tools evolve fast,
and it’s difficult to keep up with the latest tech.
5. Fresh Data (CDC): Difficult to keep data lake up-to-
date with changes made on other platforms.
6. Mainframe: Difficult to move mainframe data in
and out of Hadoop/Spark.
7. Data Movement: Difficult to move data in and out
of Hadoop/Spark.
42. Design Once, Deploy Anywhere
42Syncsort Confidential and Proprietary - do not copy or distribute
• Use existing ETL skills
• No need to worry about mappers, reducers, big side or small side of joins, etc
• Automatic optimization for best performance, load balancing, etc.
• No changes or tuning required, even if you change execution frameworks
• Future-proof job designs for emerging compute frameworks, e.g. Spark 2.x
• Run multiple execution frameworks in a single job
Single GUI Execute Anywhere!
Intelligent Execution - Insulate your organization from underlying complexities of Big Data.
43. Syncsort Makes ALL Data Accessible & Usable – Ready for Analytics
43Syncsort Confidential and Proprietary - do not copy or distribute
Get the ebook: 2018 Big Data Trends: Liberate, Integrate and Trust
http://www.syncsort.com/en/Resource-Center/BigData/eBooks/2018-Big-Data-Trends-Liberate-Integrate-Trust
Contact Syncsort sales to get the latest Syncsort info: http://www.syncsort.com/en/ContactSales