My thoughts on what data science is, what skills data scientists have, what are the current issues in the Business Intelligence pipeline, how can machine learning automate a part of the BI chain, why and how data science should be democratized and made available to every one including decision makers (business users), how business analyst should build complex data models and how data scientists should be freed up from the mundane tasks of rinse and repeat before building models that provide input for decision making, how companies can build a business practice around data science. big-data is all data and the big-data apps offer the ability to combine all data (public + private) and expand the horizon to discover more meaningful insights.
2. Data Science is…
• An art of mining large quantities of data
• An art of combining disparate data sources and blending
public data with corporate data
• Forming hypothesis to solve hard problems
• Building models to solve current problems and provide
forecast
• Anticipate future events (based on historical data) and
provide correcting actions (yield curve in finance, fraud
detection in banking, storms effect on travel, operational
downtime)
• Automating the analytics processes to reduce time to
solve future problems
3. A Data Scientists has following minimum
set of core skills…
• Problem solver
• Creative and can form an hypothesis
• Is able to program with large quantities of data
• Can think of bringing data from appropriate data
source and can bring and blend data
• Stats/math/analytics background to build models and
write algorithms
• Can quickly develop domain knowledge to understand
key factors which influence the performance of a
business problem
4. Roles data scientists play…
• Problem description
• Hypothesis formation
• Data assembly, ETL and data integration role
• Model development (pattern recognition or any other
model to provide answers) and training
• Data visualization
• AB Testing
• Propose solutions and/or new business ideas
5. The balance between human vs. machines…
• Current: humans play a significant role in the
process – ETL, joins, models, visualization, machine-
learning and then repeating and recycling this process
as the problem changes
• Tomorrow: a big portion of the food-chain can be
automated via machine learning so machines can take
over and data-scientists can be freed up to build more
algorithms/models
• The process can be automated so repeating/recycling
can be cheaper and less time consuming
6. The Data Science pipeline currently looks
like…
• From Data to Insights – this entire process requires
mundane skills (IT), specialized skills (data-scientist)
and elements of human psychology to present the
right information at the right time
• The data needs to be discovered, assembled,
semantically enriched and anchored to a business
logic – this task can be be automated through
machine learning (a set of harmonized tools with AI)
to free up scarce resources
7. The Data Science pipeline currently looks like
(cont’d)…
• Specialized skills today get addressed by open source
technologies such as R and expensive solutions like
Matlab and SPSS.
• Very few software solution carefully introduce human
interface to make their application consumable
without requiring customer training (i.e. not Google
easy)
8. The pipeline needs complete rethinking…
• Automate mundane tasks that IT gets tagged with
• Discover data automatically
• Detach business logic from data models
• Make blending public data with corporate data a
second nature
• Free up data-scientists so that they can build
analytics micro-apps for a domain or a sub-domain
• data-science need not be a niche (or a specialized
category), it should appeal to the masses
(democratization of data and brining insights to
everyone without needed specialized skills)
9. Opportunity in Data Science…
• Understand the value chain (IT + Business Analyst +
Data Scientists + Business Users)
• Provide something for everyone - a single integrated
platform (ETL + Data Integration + Predictive modeling
+ in-memory computing + storage) for data scientists
so that they can build standard analytical apps and
move away from proprietary models and standardize
(which also helps IT)
• Analytical apps on this platform (think of them as
rapid deployment solutions) for business users
10. Opportunity in Data Science (cont’d)…
• Help business analysts write basic models (churn,
segmentation, correlation etc.) without requiring
advanced skills
• Work with consulting companies so that they can
consult and build apps on your platform for
companies that do not have data scientists on their
pay-roll (like Mu-Sigma and Opera Solutions)
• Partner with public data provider (to help clients),
consulting companies (for rapid solutions),
R/Python/ML communities (to grab mind-share and
show thought-leadership)