Data lakes should be a critical part of your data initiatives, but you don’t get the benefits for free. Like any business project, a bit of planning goes a long way. To get the most from your data lake investment, it’s important to learn from others’ successes, as well as their mistakes, and apply best practices and technologies.
4. Agenda
• Data lake: Definitions and in
context with other systems
• Trends and drivers
• Impact of the cloud on data lakes
• Use cases: analytics, AI, BI
• Recommendations
• Hortonworks presentation
• Arcadia Data presentation
• Audience Q&A
5. The Data Lake: What Is It?
• Method for organizing large volumes of
highly diverse data from diverse
sources
– For broad data exploration and discovery,
plus advanced analytics
– Depending on the platform, a data lake may
handle many data structures
• Ingest first, prep later: Make data
available ASAP
– Persisting data in its raw, detailed state
• For multiple use cases: advanced
analytics, AI/machine learning, and BI
6. What’s Driving Adoption?
• Data-driven ambitions: Organizations
want to realize value from all of their data,
not just well-known datasets (e.g., in the
data warehouse)
– Hadoop/Spark ecosystems; the NoSQL
database revolution
• Instrumentation: Organizations have
more data coming online
– IoT, geolocation, multichannel customer
behavior, social media, text, images
7. Data Lakes: Becoming Established
• Similar
percentage of
organizations
surveyed
indicate plans
to use a data
lake to either
augment or
replace data
warehouse
Source: TDWI
8. Impact of the Cloud on Data Lakes
• “Data gravity” pulls data into the cloud
– Interest in reducing data movement
– Accessing and analyzing data where it is created;
30% using BI in the cloud and 38% plan to
– Overflowing the banks of on-premises data lakes:
streaming data into cloud storage; 32% in TDWI
research see this use case
• Cloud storage (e.g., object, file, block) used
to augment – or rival – Hadoop ecosystem
data lakes
9. Data Warehouses and Data Lakes
are Complementary, and They Integrate Well
RELATIONAL DATA WAREHOUSE HADOOP-BASED DATA LAKE
RDBMS for relational requirements Hadoop for diverse data, scalability, low cost
Data of recognized high value Candidate data of potential value
Mostly refined calculated data Mostly detailed source data
Known entities, tracked over time Raw material for discovering entities & facts
Data conforms to enterprise standards Fidelity to original format and condition
Data integration up front Data prep on demand
Data transformed a priori Data repurposed later, as needs arise
Typically schema on write Typically schema on read
A priori metadata improvement Metadata developed on read, in many cases
SOURCE: TDWI 2017 Best Practices Report on Data
10. BI and Analytics on Data Lakes: Growing
• “Replacing” ETL: An original attraction of data lakes
– 20%-30% plan data lake to augment or replace data warehouse
• SQL-on-Hadoop and Hadoop/Spark-native BI
– Engines for intensive SQL workloads on data lakes
• Fast interaction: moving from batch to micro-batch and
faster data interaction
– Relevance for analytics and AI/ML
– Growing relevance for operational reporting
12. REFERENCE ARCHITECTURE w/HADOOP-BASED DATA LAKE
Modern Data Warehouse Environment (DWE)
DIVERSE SYSTEM PLATFORMS: Web, Client/Server, Clusters, Racks, Grids, Clouds, Hybrid Combinations…
CRM, SFA, ERP…
Financials, billing, operations,
call center, human resources…
Traditional Data
Dimensions, cubes, subject
areas, time series, metrics...
Data for reports, dashboards…
Set-Based Analytics (SQL,
OLAP)
Core Warehouse
Based on clouds, appliances,
columns, graph, sandboxes,
and other specialized analytics
Specialized DBMSs
Machine Data, IoT, Mobile Data,
Web Data, Social Media…
New Data, Big Data
CROSS-SYSTEM INTEGRATION: Application Integration, Data Integration, Interfaces, Metadata; Queries, Virtualization…
ManyIngestionMethods
ManyDeliveryMethods
Massive Store of
Raw Source Data
(actual Data Lake)
Algorithmic
Analytics
Departmental
Data Domains
(Mktg, Sales)
Data Landing
and Staging
ETL/ELT
Processing
Stream Capture
Complex Event
Processing
Ingestion Data Lake
Hadoop
SOURCE: TDWI 2017 Best Practices Report on Data
13. Users’ Concerns RE: Data Lakes
Addressing these ensures success
• Lack of data governance (DG) (41%)
• Inadequate skills for big data (32%), Hadoop
(32%), data integration (32%)
• Lack of compelling business case (31%) or
sponsorship (28%)
• Exposing sensitive data (28%)
• Immaturity of data lake concept (27%)
• OTHER: Self-service tools for end users,
Business metadata, Automation for DG
UNKNOWN
TERRITORY
AHEAD
SOURCE: TDWI 2017 Best Practices Report on Data
14. MYTH: If we build it, they will come.
• They won’t come unless you have:
– Compelling business case, sponsorship, and funding
– Right tools for self service BI, visualization, analytics, AI/ML
• They won’t stay without:
– Ability to find governed, trusted data
– Plan for controlled expansion
• They won’t succeed without:
– Lake & tool training; consulting help
15. Data Catalog: Important to Gaining Value
from Lakes & Data Architecture
• Objective: Sharing
knowledge about
distributed data;
improving quality,
efficiency, curation, and
governance
• Not all catalogs are alike
– Some cover only single BI,
DW, data marts
– Others specialize in legacy
systems
• Growing for easier data
discovery in data lake
Central data catalog
(Data definitions, attributes,
and other metadata)
Users, applications, and services
(Search, browse, query, filter)
Databases,
data
warehouses,
data marts,
and data
lakes
ETL,
source-to-
target
mappings
Applications, logs,
operational
systems
BI/OLAP
systems
(with their
own
metadata)
Mainframe,
files, other
legacy
systems
16. Empowerment Expectations for
Business Analytics via a Data Lake
• Users expect self-service access to lake data
– Without it, the lake is a failure to them
• Integrated sequence of self-service best practices.
– Data access, exploration, prep, visualization, and analysis
– This requires friendly business metadata or equivalent semantics
• Advanced analytics that complement older analytics.
– Firms want more than just OLAP & SQL-based analytics.
– Hadoop-based, algorithmic analytic processing: mining, stats, graph…
• Analytic value from human language, text, other unstructured data.
– Hadoop-based lake is natural fit for this untapped data.
17. Recommendations
Focus on business use cases that a lake can address.
Identify the desired ROI
Prioritize use cases – as there will be many proposed!
Evaluate demand for BI and business analytics access
and interaction with data
Evaluate potential of shared data catalog for improving
data discovery and management of data lake
Think about the big picture: the enterprise data
architecture (involving cloud and on premises)
19. Poll Question
• What do you see as the #1 role (or potential role) of a data
lake in your organization?
– To support advanced analytics and AI/machine learning
– For offloading data warehouse processes (e.g., ETL)
– To replace the data warehouse
– Place to collect operational and streaming data
– To support self-service BI, data discovery, and visualization
– Not sure, don’t know
34. ArcadiaData.Proprietaryand Confidential34
Arcadia Data Mission: To Connect Business Users to Big Data
Founding team from Teradata
Aster, HPE 3PAR, IBM DB2
Architected to solve challenges
around big data analytics
Growing customer base
in the Fortune 500
Investors
Strong Performer: Hadoop Native BI Wave Report.
”Put your BI where your data is”
Gartner names Arcadia Data a Cool
Vendor for IoT Analytics.
Recent Awards
Customers
Winner Datanami Editors Choice Award for Best
Big Data Product and Technology: Data
Visualization
35. ArcadiaData.Proprietaryand Confidential
Data Drives Market Disruption
35
Campaign Analysis Application
Understand high-level metrics with the ability to drill
down to details
Augment analysis with a
variety of data types &
sources such as actual
display ad images
36. ArcadiaData.Proprietaryand Confidential
Data Drives Market Disruption
36
Retail Store Geographic Analysis
YoY Growth
metrics plotted by
county for the
chose sub-brand
Trellising allows for
quick trend analysis
across multiple stores.
Here showing store
sales vs trade area
sales to correlate
potential shifts in buying
pattern
Choose a
specific state to
drill down to
county level
38. 38 Arcadia Data. Proprietary and Confidential
Getting Insight From Big Data Is Hard
Platforms that very few
people can use
Business users that email
Excel sheets around
Traditional BI on Data Lakes
39. Arcadia Data. Proprietary and Confidential
39
A Framework for Data Lake BI
Native
Architecture
Smart
Acceleration
Connected
Ecosystem
AI Scoring and Recommendations
42. Arcadia Data. Proprietary and Confidential42
Instant Visuals – AI-Based Visualization Recommendations
Select data fields, then one click…
Visualization Builder Recommended Visualizations
shows which visuals best represent your data.
43. ArcadiaData.Proprietaryand Confidential43
Query acceleration for
scale, performance,
and concurrency
Smart Acceleration Leverages What Is Learned during Data Discovery
Ad hoc queries
Arcadia Enterprise makes
recommendations –
build these with a click.
Data Lake
Cluster
• Fast query responses
• Minimal modeling
• Live acceleration (no downtime)
All Granular
Data
Analytical
Views
Accelerated
applicationqueries
44. ArcadiaData.Proprietaryand Confidential44
Sample Results of Benchmark Tests
SignificantAccelerationfor Dashboards
Improvementof 21x to 88x
EfficientConcurrencyScaling
Responsiveness at 95 ConcurrentUsers
https://www.arcadiadata.com/lp/esg-technical-review-native-bi-performance-for-data-lakes/
45. ArcadiaData.Proprietaryand Confidential45
ODBC to Big Data SQL Engine
PRO: Can query all the data in Hadoop
CON: Can’t scale on user concurrency
Extract Data to BI Server
PRO: Scales to more users
CON: Can’t query all the data
Bolt-On an External BI Accelerator
PRO: More users on more data
CON: Complicated architecture,
multiple security models,
more software products, edge nodes,
data movement, etc.
Three Ways Legacy BI Tools Cannot Scale on Data Lakes
extract
query
results
BIAccelerator
ODBC
1 2 3
1
2
3
BI Server BI Server
Hadoop nodes Edge nodes
46. ArcadiaData.Proprietaryand Confidential46
How Arcadia Data Scales for Data Lakes
ArcEngine. The powerful query
acceleration engine runs
directly on the compute nodes
of the cluster.
Analytical Views.Queries are
accelerated via pre-computed
AVs. These are stored right next
to the raw data in HDFS or cloud
object stores.
ArcViz. The lightweight visualization server
handles the UI, and includes a data coherent
cache to boost performance/concurrency
while always honoring data and permission
changes to the underlying store.
Hadoop Cluster
(Hive, Spark, YARN, HDFS, etc.)
3
2
1
48. ArcadiaData.Proprietaryand Confidential48
Data Warehouse BI Architecture
BI Server
Analytic Process
Optimize Physical
Semantic Layer
Secure Data
Load Data
BigDataRequirements
Native Connection
Semi-Structured
Parallel
Real-time
Data Warehouse
(RDBMS)
49. ArcadiaData.Proprietaryand Confidential49
Data Lake BI Architecture – The Arcadia Data Way
BI Server
Analytic Process
Optimize Physical
Semantic Layer
Secure Data
Load Data
BigDataRequirements
Native Connection
Semi-Structured
Parallel
Real-time
Data Warehouse
(RDBMS)
Data Lake
(HDFS, Cloud Object Storage)
Arcadia Data was built
from inception to
run nativelywithin data lakes
50. Social media: @arcadiadataarcadiadata.com
50
How Neustar MarketShare
Scales Their Data Lake
Search-Based BI
in Action
Download
Arcadia Instant
https://www.arcadiadata.com/lp/how-neustar-
scales-saas-based-reporting-and-analytics-
for-1000s-of-customer-users/
http://watch.arcadiadata.com/watch
/NSf1mMENjYfTY2cjpuGWPS
arcadiadata.com/instant
www.arcadiadata.com/resources
52. CONTACT INFORMATION
If you have further questions or comments:
David Stodder, TDWI
dstodder@tdwi.org
Dale Kim, Arcadia Data
dale@arcadiadata.com
Ali Bajwa, Hortonworks
abajwa@hortonworks.com
tdwi.org
53. 53
TDWI Conference
Keynotes, Educational Classes, Networking, and More
Las Vegas, NV, February 10-15, 2019
http://www.tdwi.org/lasvegas
*
TDWI Strategy Summit
Case Studies, Expert Talks, and Leadership Panels
Las Vegas, NV, February 11-12, 2019
https://tdwi.org/events/strategy-summits/las-vegas
Learn More in Las Vegas!