This document provides an overview of a presentation on being a polyglot data scientist using multiple languages and tools. It discusses using SQL, R, and Python together in data science work. The presentation covers the challenges of being a polyglot, how SQL Server with R or Python can help solve problems more easily, and examples of analyzing sensor data with these tools. It also discusses resources for learning more about R, Python, and machine learning services in SQL Server.
2. Audience Survey
• How many here have used:
– SQL?
– Python?
– R?
• What job titles do people have?
3. What We Won’t Cover
• Theories behind data science and machine learning
• Deep dive into Python
• Deep dive into R
• Deep dive into SQL Server
4. There is a data science VM available on
Azure. It won’t be covered in this
presentation.
See https://docs.microsoft.com/en-
us/sql/advanced-analytics/getting-started-
with-machine-learning-services for details.
Azure Support
5. What We Will Cover
• The Problem with Being a Polyglot
• What SQL Server + R or SQL Server + Python Solves
• A Glance at these in Action
6. Not a Microsoft sales person…
• Microsoft MVP in
Visual Studio
• Been into exploring
data most of my life
• Been in tech over 20
years
• Practitioner and
hobbyist, not
researcher
7. Sample Problem: Sensor Data
• Domain: House of Sadukie
• Problem: Temperature data is
stored miserably
• Goal: Display data in a
visualization that makes sense
11. Data Scientist
A person employed to analyze and
interpret complex digital data, such as
the usage statistics of a website,
especially in order to assist a business
in its decision-making
12. Multi-Faceted Data Science
• Various categories:
– Statistics – modeling, sampling, clustering, reduction
– Mathematics – NSA, astronomers, military
– Data engineering – database/memory/file optimization, Hadoop, data flows
– Machine learning and algorithms
– Business – ROI optimization, decision sciences
– Software engineering – primarily polyglots in production code
– Visualization
– Spatial
Source: https://www.datasciencecentral.com/profiles/blogs/six-categories-of-
data-scientists
13. The Problem with Being a Polyglot
• Understanding strengths and weaknesses of the languages
• Knowing which language is appropriate for what situation
15. What R and Python Have to Offer
for SQL
• Libraries specialized to handle data science domain problems
including:
– Visualization
– Data exploration
– Statistical and Mathematical Analysis
– Trending
– Regression
• Libraries + Data right from the source = quicker exploratory analysis
• Python and R are great working from one large table and branch for
different directions
– Which can inspire additional analyses
16. Sample Problem: Sensor Data
• Number of rows: 400k+
• 1 Table
• Questions to look into:
– What are temperature trends over
time?
– When are sensors going offline?
– What temperatures look spot on?
– What sensors are wavering in reads
and showing inconsistencies?
18. Advanced Analytics
in
SQL Server 2016/2017
• SQL Server 2016
• SQL Server R Services / Machine
Learning Services
• SQL Server 2017
• SQL Server R Services / Machine
Learning Services
• Python Support
19. Sample Problem: Sensor Data
• Possible Strategy:
– Use SQL to gather the data into a
dataset that has the most amount of
data to observe.
– Use Python or R to manipulate the
data results and allow for easy analysis
and substantial predictions based on
observations.
20. Not Just Windows!
R Server for Windows
R Server for Linux
- CentOS
- RHEL
- Ubuntu
- SUSE
R Server for Hadoop – cluster in the cloud
R Server for Teradata – not as Machine Learning
Server
21. SQL Server as our Base
R and/or Python on Top
Additional pieces provided by MachineML:
Microsoft Machine Learning Services, RevoScaleR, RevoScalePy
23. Machine Learning Services in SQL
Server
• Allows integration of other languages in SQL Server
– SQL Server 2016 can work with R
– SQL Server 2017 introduces Python support
• Scalable in that you can develop and test on a single machine
and then deploy to distributed or parallel processing platforms.
Platforms include:
– SQL Server on Windows
– Hadoop
– Spark
24. SQL Server Machine Learning
Services (In-Database)
• SQL Server R Services (In-Database) started in SQL Server 2016
• With SQL Server 2017, SQL Server Machine Learning Services (In-
Database) allows us to use R and Python within SQL Server
• Do not need to open IDE and SQL tools to accomplish the work –
no context switching needed!
• Can call libraries from Python or R to process data right within
SQL
25. Python vs R?
• SQL Server 2016? R
• SQL Server 2017? R and/or Python
• What are you familiar with?
• Look at tutorials – what makes sense?
• What features do you need and how are they supported by
Microsoft ML?
26. Python Support
• CPython 3.5
• revoscalepy – Python equivalents of RevoScaleR
• Remote compute contexts
• Also supports familiar libraries such as:
– scikit-learn
– Tensorflow
– Caffe
– Theano/Keras
27. R Code in SQL
DECLARE @rscript NVARCHAR(MAX);
SET @rscript = N'
SensorData <- SqlData;
print(summary(SensorData))';
DECLARE @sqlscript NVARCHAR(MAX);
SET @sqlscript = N'
SELECT * FROM Sensors;';
EXEC sp_execute_external_script
@language = N'R',
@script = @rscript,
@input_data_1 = @sqlscript,
@input_data_1_name = N'SqlData',
@output_data_1_name = N'SensorData';
28. Python Code in SQL
execute sp_execute_external_script
@language = N'Python',
@script = N'
summary = pandas.DataFrame.describe(InputDataSet)
print(summary.transpose())
',
@input_data_1 = N'SELECT * FROM Sensors';
GO
30. What is RevoScaleR?
• A library written in R that includes functions for importing,
transforming, and analyzing data
• Scalable, portable, and easily distributable
• Things it can do include:
– Descriptive statistics
– Generalized linear models
– Logistic Regression
– Classification trees
– Decision forest
• Multithreaded and multinode
31. Running RevoScaleR
• Part of the Machine Learning Server and Microsoft R products
• Can use any R IDE to write scripts that use RevoScaleR
• Needs to be run on a computer with the interpreter and libraries
• Two modalities:
– Locally
– Remote compute context
– Shift execution to the server
– Windows server
– Hadoop
– Spark
32. Prediction
• Linear models
• Logistic regression models
• Generalized linear models
• Covariance and correlation
• Decision forest
• K-means clustering
39. Two Use Cases for Remote
Computer Context
• Running R in T-SQL scripts or stored procedures
• Calling RevoScaleR in R from a SQL context
40. Visual Studio 2017: One IDE with
Common Tools
• Python Tools for Visual Studio
• R Tools for Visual Studio
• SQL Server capabilities within Visual Studio
42. Polyglot Data Scientist Presentation
Resources
• R Services in SQL Server 2016 (Channel 9)
• Built-in machine learning in Microsoft SQL Server 2017 with Python
(Build 2017)
• MicrosoftML 1.3.0: What’s new for machine learning in Microsoft
R Server (Channel 9)
• Using Visual Studio for Machine Learning (Build 2017)
• Performance patterns for machine learning services in SQL Server
(Microsoft Ignite 2017)
44. Resources
• Kaggle: The Home of Data Science and Machine Learning
• DataCamp: Learn R, Python, and Data Science Online
• Difference between Machine Learning, Data Science, AI, Deep
Learning, and Statistics – Vincent Granville
• Python Tutorial from Mode Analytics
• Coursera
– Mastering Software Development in R Specialization
– Data Science Specialization
– Applied Data Science with Python Specialization
– Executive Data Science Specialization
45. Contact Me
• Twitter: @sadukie
• Blog: http://codinggeekette.com
• Email:
sarah@cletechconsulting.com
Sarah Dutkiewicz
Cleveland Tech Consulting, LLC
Owner