Watch full webinar here: https://buff.ly/309CZ1Y
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
*How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
*How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
*How you can use the Denodo Platform with large data volumes in an efficient way
*About the success McCormick has had as a result of seasoning the Machine Learning and Blockchain Landscape with data virtualization
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
Data Virtualization Simplifies Data Science Workflows
1. DATA VIRTUALIZATION PACKED LUNCH
WEBINAR SERIES
Sessions Covering Key Data Integration Challenges
Solved with Data Virtualization
2. Minimizing the Complexities of Machine Learning
with Data Virtualization
Pablo Alvarez-Yanez
Director of Product Management, Denodo
3. 3
Chikio Hayashi, 1998: "What is Data Science? Fundamental Concepts and a Heuristic
Example"
Data science is a concept to unify statistics, data
analysis, machine learning and their related methods in
order to understand and analyze actual phenomena
with data
4. 4
Data Science – Brief History
Data Science is an umbrella term that has recently received a lot of
media attention
However, making sense of data in some way has been the job of
scientists, statisticians, computer scientist and business analysts for
years
The term data science was used for the first time in Japan during 1996
in a conference by the International Federation of Classification
Societies (IFCS)
For a good review of the history of the term, check the Forbes article
“A Very Short History of Data Science”
• https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-
history-of-data-science/#53641eb955cf
5. 5
The Tools of Data Science
When thinking about data science, most
minds immediately go to languages of
Python and R, or tools like Spark and
TensorFlow
There is a myriad projects that currently
server the needs of the data scientist
6. 6
The Data Scientist Workflow
A typical workflow for a data scientist is:
1. Gather the requirements for the business problem
2. Identify useful data
▪ Ingest data
3. Cleanse data into a useful format
4. Analyze data
5. Prepare input for your algorithms
6. Execute data science algorithms (ML, AI, etc.)
▪ Iterate steps 2 to 6 until valuable insights are
produced
7. Visualize and share
Source:
http://sudeep.co/data-science/Understanding-the-Data-Science-Lifecycle/
7. 7
Where does your time go?
A large amount of time and effort goes into tasks not intrinsically related to data
science:
• Finding where the right data may be
• Getting access to the data
• Bureaucracy
• Understand access methods and technology (noSQL, REST APIs, etc.)
• Transforming data into a format easy to work with
• Combining data originally available in different sources and formats
• Profile and cleanse data to eliminate incomplete or inconsistent data points
10. 10
Data Scientist Flow
Identify useful
data
Modify data into
a useful format
Analyze data Execute data
science algorithms
(ML, AI, etc.)
Prepare for
ML algorithm
11. 11
Identify useful data
If the company has a virtual layer with a good coverage
of data sources, this task is greatly simplified
• A data virtualization tool like Denodo can offer unified
access to all data available in the company
• It abstracts the technologies underneath, offering a
standard SQL interface to query and manipulate
To further simplify the challenge, Denodo offers a Data
Catalog to search, find and explore your data assets
12. 12
Search & Explore: Metadata
Search the catalog and refine your results using descriptions, tags and business
categories
13. 13
Search & Explore: Content
Integration with Lucene and ElasticSearch for indexing and performing keyword-base
searches on the content
14. 14
Document your models
Rich HTML descriptions, editable directly from the catalog
Extended metadata support to enrich the catalog with custom fields and details
15. 15
Data Scientist Flow
Identify useful
data
Modify data into
a useful format
Analyze data Execute data
science algorithms
(ML, AI, etc.)
Prepare for
ML algorithm
16. 16
Ingestion and Data Manipulation tasks
• Typically, scientists get data from a variety of places
through various formats and protocols. From relational
databases, to REST web services or noSQL engines.
• Data is often exported into CSV files or loaded into Spark
• Later, that data is manipulated in scripts (e.g. Pandas
and Python)
• However, data virtualization offers the unique
opportunity of using standard SQL (joins, aggregations,
transformations, etc.) to access, manipulate and analyze
any data
• Cleansing and transformation steps can be easily
accomplished in SQL
• Its modeling capabilities enable the definition of views
that embed this logic to foster reusability
25. 25
Denodo and Spark: data science with large volumes
✓ Spark as a source
▪ Spark, as well as many other Hadoop systems (Hive, Presto, Impala, HBase, etc.), can be use
by Denodo as a data source to read data
✓ Spark as the processing engine
▪ In cases where Denodo needs to post-process data, for example in multi-source queries,
Denodo is able to lift and shift to automatically use Spark’s engine for execution
✓ Spark as the data target
▪ Denodo can automatically save the data from any execution in a target Spark cluster when
your processing needs (e.g. SparkML) require local data
26. 26
Access to Big Data Sources
Single access to all data assets, internal and external:
▪ Physical Data Lake, usually based on SQL-on-Hadoop
systems:
▪ SparkSQL (onPrem, Databricks)
▪ Presto
▪ Impala
▪ Hive
▪ Other relational databases (EDW, ODS, applications, etc.)
▪ NoSQL (MongoDB, HBase, etc.)
▪ Indexes (ElasticSearch)
▪ Files (local, S3, Azure, etc.)
▪ SaaS APIs (Salesforce, Google, social media, etc.)
27. 27
Using Spark’s Processing Engine
Denodo optimizer provides native integration
with MPP systems to provide one extra key
capability: Query Acceleration
Denodo can move, on demand, processing to the
MPP during execution of a query
• Parallel power for calculations in the
virtual layer
• Avoids slow processing in-disk when
processing buffers don’t fit into
Denodo’s memory (swapped data)
28. 28
Ingesting and Caching
Denodo’s integration with SQL-on-Hadoop systems is bi-
directional: remote tables and caching enable Denodo to
create tables and load them with data
This allows to quickly load any data accessible by Denodo
to the Hadoop cluster.
• It’s significantly faster than tools like Sqoop.
This approach becomes an alternative to ingestion and ELT
processes.
• However, it preserves lineage and governance
Load process based on direct load to HDFS/S3/ADLS:
1. Creation of the target table in Cache system
2. Generation of Parquet files (in chunks) with Snappy
compression in the local machine
3. Upload in parallel of Parquet files to HDFS
30. 30
Key Takeaways
✓ Denodo can play key role in the data science ecosystem to
reduce data exploration and analysis timeframes
✓ Extends and integrates with the capabilities of notebooks,
Python, R, etc. to improve the toolset of the data scientist
✓ Provides a modern “SQL-on-Anything” engine
✓ Can leverage Big Data technologies like Spark (as a data
source, an ingestion tool and for external processing) to
efficiently work with large data volumes
✓ Helps productionalize data science