SlideShare uma empresa Scribd logo
1 de 49
Baixar para ler offline
Prolegomena
to Any Future Statistics,
that will be able to
present itself as a
(Data) Science
Carlo Lauro
Emeritus Professor of Statistics
University of Naples Federico II
Is there a Data Science? If yes, then what is
Data Science? And what does Data Science
mean in “data revolution era”? What about
new professions? What are the challenges
for Statistics?
(LET'S TALK ABOUT DATA SCIENCE)
carlo.lauro@unina.it
carlo.lauro@selfschool.ch
Scientific Meeting
in Memory of Simona Balbi
Naples, February 19° , 2019
Director of the Department of
Economic & Managerial Sciences
in Digital Era
SELF HOCHSCHULE, ZUG - CH
• “Data Science: The Sexiest Job of the 21st
Century” (T. Davenport & D.J. Patil)
• “Data Scientist : Person who is better at
statistics than any software engineer and better at
software engineering than any statistician.” (Josh
Wills, Cloudera )
Is Data Science still a buzzword without a clear
definition?
Is Data Science just a rebranding of Statistics?
‘’Let’s talk about Data Science’’
Data Science and Data Scientists
‘’Let’s talk about Data Science’’
According with Sir Maurice Kendall, among the issues the statisticians do not agree,
there is the definition of their science. As a consequence, dictionaries and
encyclopedias, do not share a common idea on what Statistics is.
Similar problems seem to happen analysing the scientific literature on the subject
matter as well as the various forum and blogs present in social networks where a
common definition for Data Science is The Science of extraction the knowledge from the
Data the same one used in Statistics. As for Statistics, we observed also another a
common view , ‘’Data Science is what Data scientists do ‘’. So far is unclear if a Data
science is a science or a profession? The Data Science Association introduce itself as a
profession. Probably a Data Science is both. In fact it has the peculiarty of a
‘Methodological Science’ (Tosio Kitagawa) with no object but its object is to develop a
unified methodology applicable to other categories of sciences.
With the aim to propose a satisfactory definition to the different people that coexist in
this colorful world of the Data science we analysed about 150 Data Science and Data
scientist definitions by a lessical corrispondence analysis and a SNA.
But what is also more relevant for us is to try understand eventual threats and
challenges that can derive for Statistics and statisticians as consequence of the actual
data revolution characterized by large amounts of data (big data) of various types
(numeric, ordinal, nominal, symbolic, texts, images, data streams, multi-way, networks
,etc.), coming from disparate sources (surveys, administrative data,social media, sensors,
transactions, open data).
‘’Let’s talk about Data science ‘’
A short history of Data Science (Forbes Magazine, May ’13)
1962 John W. Tukey writes “The Future of Data Analysis”
1974 Peter Naur publishes Concise Survey of Computer Methods in Sweden
1977 The International Association for Statistical Computing (IASC) is established as a
Section of the ISI. “It is the mission of the IASC to link traditional statistical methodology,
modern computer technology, and the knowledge of domain experts in order to convert
data into information and knowledge.”
1989 Gregory Piatetsky-Shapiro organizes and chairs the first Knowledge Discovery in
Databases (KDD) workshop.
1993 J. Chambers presents the concept of learning from data as a challenges as well as
exciting opportunities for Statistics.
1996 The International Federation of Classification Societies (IFCS) for the first time, uses the
term in the conference “Data science, classification, and related methods”.
1996 Usama Fayyad, Gregory Piatetsky- Shapiro, and Padhraic Smyth publish “From
Data Mining to Knowledge Discovery in Databases.”
1997C.F. Jeff Wu : “Statistics = Data Science?”
2001 William S. Cleveland publishes “Data Science: An Action Plan for Expanding the
Technical Areas of the Field of Statistics.”
2002/2003 Launch of Data Science Journal / Launch of Journal of Data Science
2007 The Research Center for Dataology and Data Science is set at Fudan University, China.
2010 Mike Loukides writes in “What is Data Science”. Drew Conway “DS Venn diagram”
2012 Tom Davenport & D.J Patil, “Data Scientist: The Sexiest Job of the 21st Century”
Tukey 1962:
“…my central interest is in data analysis, which I take to include, among
other things:
Procedures for analysing data, techniques for interpreting the results of
such procedures, ways of planning the gathering of data to make its
analysis easier, more precise or more accurate, and all the machinery and
results of (mathematical) statistics which apply to analysing data…”
Tukey identified four driving forces in the new science:
“Four major influences act on data analysis today:
1. The formal theories of statistics
2. Accelerating developments in computers and display devices
3. The challenge, in many fields, of more and ever larger bodies of data
4. The emphasis on quantification in an ever wider variety of disciplines”
‘’Let’s talk about Data science ‘’
‘’Let’s talk about Data science ‘’
The origin of Data Science: the Benzecri’s 5 principles of Data
Analysis
Forbes published "A Very Short History of Data Science" but may too short as it forgets
the fundamental contribution by JP Benzecri in the 60's. In the book "L'analyse des
données" published by Dunod, Benzecri in 1973 for the first time sets out the 5 major
principles on which Data analysis have to be based .
• The first principle states that "The statistics is not probability, under the name of
(mathematical) statistics was built a pompous discipline based on theoretical
assumptions that are rarely met in practice."
• The second principle states that "the models should follow the data., not vice versa."
In fact is asserting the priority of the data or the approach to the extraction of
knowledge in an optical data-driven.
• The third specifies that "you must simultaneously process the information relating to
the greater number of possible dimensions so as to provide a sufficiently complete
representation of the phenomena of interest." It seems that in this principle advances
the role of the"big data",
• Finally, the last two principles relate to the basic use of the computer to process the
data "for the analysis of complex phenomena (facts) the computer is indispensable"
and even "use the computer implies the abandonment of all the techniques designed
before of computing ". This latter principle advocates the change of the paradigm of
classical statistics.
Paradigm Nature Form When
First Experimental science Empiricism; describing natural
phenomena
pre-Renaissance
Second Theoretical science Modelling and generalization pre-computers
Third Computational science Simulation of complex phenomena pre-big data
Fourth Exploratory science
/Data Science
Data-intensive; statistical exploration
and data mining
Now
CHANGE OF PARADIGM IN SCIENCE
By Science (Wikipedia) «we mean a system of knowledge obtained through an
organized research activity and with methodical and rigorous procedures (the
scientific method), with the aim of reaching, through tests , a description, likely,
objective and predictive. , of reality and laws that regulate the occurrence of
phenomena».
The data revolution characterized by large amounts of data (big data) of various
types (numeric, ordinal, nominal, symbolic, texts, images, data streams, multi-way,
networks ,etc.), coming from disparate sources (surveys, administrative and official
data, social media, sensors, transactions, open data) offers great opportunities to
enhance knowledge on many key research areas that will bring a strong change in
the paradigm of a science.
Data revolution : more and new data
Stream data
Symbolic data
Multi sources data
Text data
High dimensional data
Multimedia data
Network data
Complex data
To be termed scientific a
method to acquiring scientific
knowledge is commonly
based on empirical or
measurable evidence subject
to specific principles of
reasoning.
The Oxford Dictionaries
Online defines the scientific
method as "a method or
procedure that has
characterized natural science
since the 17th century,
consisting in:
(1) systematic observation;
(2) hypotheses formulation ;
(3) perform an experiment;
(4) collection and analysis
data to confirm (testing)
hypotheses . If rejected back
to (2 ) and refine 0r alterate
hypothesis;
(5) report findings and
(6) assure results
reproducibility to develop a
theory or take action.
Experiments are an
important tool of the
scientific method.
The best hypotheses lead to
predictions that can be
tested in various ways. The
strongest tests of hypotheses
come from carefully
controlled experiments that
gather empirical data.
Data Scientists use the
scientific method?
The Data Science Method
1.Problem Identification
2.Data Collection, Organization, and
Definitions
3.Exploratory Data Analysis
4.Pre-processing and Training Data
Development
5.Fit Models with Training Data Set
6.Review Model Outcomes—Iterate over
additional models as needed.
7.Identify the Final Model
8.Apply the Model to the Complete Data
Set
9.Review the Results—Share your findings
10.Finalize Code and Documentation
How to take a data science projects by using a methodological
approach similar to the scientific method coined the Data Science
Method.
The biggest difference between people that are successful as data scientists and those
that are not, is their ability to effectively frame data science projects and
communicate project outcomes.
DATA METHODOLOGIES
Let’s talk about Data science
Data Science definitions data base
DATA SCIENCE year defininition
pagina web A field of big data which seeks to provide meaningful information from
large amounts of complex data. Data Science combines different fields of
work in statistics and computation in order to interpret data for the
purpose of decision making
2 accademico 2014 A major goal of Data Science is to make it easier for others to find and
coalesce data with greater ease. Data Science technologies impact how
we access data and conduct research across various domains, including
the biological sciences, medical informatics, social sciences and the
humanities.
2 accademico 2010 Ability to] obtain, scrub, explore, model and interpret data, blending
hacking, statistics, and machine learning
1 professionist
a
2010 An unfortunate, unclear and misleading term that has emerged recently
which refers to some subset of activities in the overall knowledge
discovery process. What additional descriptive power data science
provides beyond data mining and knowledge discovery is unclear.
2 accademico 2017 Data Science aims to transform data into actionable knowledge to
perform predictions as well to support and validate decisions. Computer
Science represents the language of the Data Science whereas Statistics is
the Logic of the Data Science itself. However, in this process the domain
expertise constitutes the catalytic element in the absence of which the
transformation cannot be achieved".
2 accademico 2012 Data Science becomes clear pretty quickly that data science has two
parents in traditional academia: statistics and computer science.(
Data Science through a SNA
‘’Let’s talk about Data science’’
‘’Let’s talk about Data science’’
A Lexical Correspondence analysis of 70 DS definitions
1st axe: opposition of Research and Professional DS. 2nd axe: opposition of domain Data Sciences
A typology according 4 Clusters: Epistemology DS, Methodology DS, Social DS, Business DS
’’Let’s talk about Data science’’
Cluster analysis of Data Science: central definitions
First group: Data Science Epistemology
18 Dataology and Data Science emphasizes on both theories and technologies, more importantly, it studies the laws in datanature not
only ones in nature. It would represent the future direction and have breakthrough in the near future
16 Dataology and Data Science is an umbrella of theories, methods and technologies for studying phenomena and laws of datanature
Second group: Data driven (Social) Data Science
3 Whether... data is search terms, voice samples, or product reviews,... users are in a feedback loop in which they contributeto the
products they use. That's the beginning of Data Science.
46 Data science, also known as data-driven science, is an interdisciplinary field about scientific methods, processes and systems to
extract knowledge or insights from data in various forms, either structured or unstructured
Third group: Business Data Science
21 So far the main goal of Data Science is to provide a statistical framework for studying the problem of gaining knowledge, making
predictions, making decisions or constructing models for specific domains.
20 It may be helpful to think of Data Science and business intelligence as being on two ends of the same spectrum, with business
intelligence focused on managing and reporting existing business data in order to monitor or manage various concerns within the enterprise.
In contrast, Data Science applies advanced analytical tools and algorithms to generate predictive insights and new product innovations that
are a direct result of the data
29 Data Science aims to transform data into actionable knowledge to perform predictions as well to support and validate decisions.
Computer Science represents the language of the Data Science whereas Statistics is the Logic of the Data Science itself. However, in this
process the domain expertise constitutes the catalytic element in the absence of which the transformation cannot be achieved".
Fourth group: Data Science Methodology
22 Data Science incorporates varying elements and builds on techniques and theories from many fields, including mathematics,
statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing,
and high performance computing with the goal of extracting meaning from data and creating data products.
49 Data science” is the general analysis of the creation of data. This means the comprehensive understanding of where data comes
from, what data represents, and how to turn data into actionable information (something upon which we can base decisions). This
encompasses statistics, hypothesis testing, predictive modeling, and understanding the effects of performing computations on data, among
other things. Science in general has been armed with many of these tools, but data science pools the necessary tools together to provide a
scientific discipline to the analysis and productizing of data.
Summarizing Data Science is …..
Data Science is an interdisciplinary approach, based mainly on the methods of
Computational Science and Statistics suitably supplemented by the Knowledge of the
different domains to meet the new challenges posed by the l Information Society.
Computational Science represents the language of the Data Science whereas Statistics is
the logic of the Data Science itself. The Knowledge of the various domain of interest
constitutes the prerequisite of a Data Science. Thus, from this point of view, it would be
preferable to speak about DATA SCIENCES.
Data Sciences adopt and/or develop appropriate methodologies for purposes of
knowledge discovery, forecasting and decision-making in the face of an increasingly
complex reality often characterized by large amounts of data (big data) of various types
(numeric, ordinal, nominal, symbolic data, texts, images, data streams, multi-way data,
networks etc.), coming from disparate sources (surveys, official data,social media, sensors,
transactions, open data).
The main novelty in the Data Sciences is played by the role of the KNOWLEDGE. Its
encoding in the form of logical rules or hierarchies, graphs, metadata, ontologies, will
represent a new and more effective perspective to data analysis and interpretation of
results if properly integrated in the methods of a Data Science. It is in this sense that the a
Data Science is a discipline whose methods, result of the intersection between Statistics,
Computer Science and a Knowledge Domain, that has as its purpose to give meaning to
the data. Alternatively A Data Science can be defined as A Knowledge based
Computational Statistics, or “Intelligent” Computational/Statistical Data Analysis.
Data Science = Knowledge based or ‘Intelligent’ Computational Statistics
= ‘Intelligent’ Computational or Statistical Data Analysis
Some CS tools:
Data extraction and preparation; Data
Warehousing; Optimization and numerical
algorithms; Simulation; High Performance
Computing; R; Hadoop; Python; SAS; Rapid
Miner;Tableau;; Visualization ; Data Mining;
A. I.; ANN; Machine Learning ;…..
Some Stat tools:
Exploratory methods ; Density
estimation; Regression; Time
series; Causal Models and SEM;
Bayesian models; Factorial analysis and
PCA; Cluster analysis; Classification;
SNA ……
Some Knowledge representation
tools:
Logical rules; Hierarchical rules;
Probability models; Graphs;
Network; Metadata; Ontologies….
The Data Science curvilinear triangle
a DS definition by Carlo Lauro
The Data Science adopts and/or develops appropriate methodologies for purposes of knowledge discovery, prediction and decision-
making in the face of an increasingly complex reality often characterized by large amounts of data (big data) of various types (numeric,
ordinal,nominal, symbolic, texts,images, data streams,multi-way, networks ,etc.),comingfrom disparate sources (surveys, official data,
socialmedia,sensors,transactions,opendata,etc.)
The role of Knowledge in DS
SDA -> Data = Model + Error
STATISTICS
COMPUTATIONAL
SCIENCE
DS
Computational Statistics
Statistical Data
Analysis
KNOWLEDGE
DOMAIN
Computational Data
Analysis
CDA -> Data = Algorithm +
Accuracy
(The 2 cultures, Breiman)
Data Science (DS) is an interdisciplinary approach to meet the
challenges of the Information Society, based on the methods of
Computational Science and Statistics supplemented by
Knowledge of the different domains. Computational Science
represents the language of the Data Science, whereas
Statistics is its logic. The Knowledge of various domains of
interest constitutes the prerequisite of a Data Science.
Computational science (also scientific computing ) is a rapidly growing multidisciplinary field
that uses advanced computing capabilities to understand and solve complex problems. It is
an area of science which spans many disciplines, but at its core it involves the development
of models and simulations to understand natural systems. Computational science is now
commonly considered a third mode of science, complementing and adding
to experimentation/observation and theory. Substantial effort in computational sciences has
been devoted to the development of algorithms (numerical and non-numerical), computer
simulations, their efficient implementation in programming languages, and validation of the
results to solve science, engineering, and humanities problems.
Computational scientist should be capable of:
- recognizing complex problems; adequately conceptualise the system containing these
problems; design algorithms suitable for studying this system;
- choose a suitable computing infrastructure (parallel computing / grid computing
/supercomputers)
- maximising the computational power of the simulation; assessing to what level the
output of the simulation resembles the systems i.e. the model is validated; adjust the
conceptualisation of the system accordingly; repeat cycle until a suitable level of
validation is obtained.
The computational scientists trusts that the simulation generates adequately realistic results
for the system, under the studied condition.
Not to be confused with Computer and information science
that develops and optimizes the advanced system
hardware, software, networking, and data management
components needed to solve computationally demanding problems.
‘’Let’s talk about Data Science’’
Data Scientist vs Statistician on Google citations
Data
Scientist
Statistician
Let’s talk about Data science
Data Scientist :
ID AUTHOR DEFINITIONS
1 DJ Patil
A data scientist is that unique blend of skills that can both unlock the insights of
data and tell a fantastic story via the data
2 Mike Loukides
Data scientists are involved with gathering data, massaging it into a tractable
form, making it tell its story, and presenting that story to others
3 Jake Porway
A data scientist is a rare hybrid, a computer scientist with the programming
abilities to build software to scrape, combine, and manage data from a variety
of sources and a statistician who knows how to derive insights from the
information within. She combines the skills to create new prototypes with the
creativity and thoroughness to ask and answer the deepest questions about the
data and what secrets it holds
4 Steve Hillion
analytically-minded, statistically and mathematically sophisticated data
engineers who can infer insights into business and other complex systems out
of large quantities of data
5 Hillary Mason
A data scientist is someone who blends, math, algorithms, and an
understanding of human behavior with the ability to hack systems together to
get answers to interesting human questions from data
6 Anjul Bhambhri
A data scientists is part digital trendspotter and part storyteller stitching
various pieces of information together
7 Malcolm Chisholm
A data scientist is somebody who is inquisitive, who can stare at data and spot
trends. It’s almost like a Renaissance individual who really wants to learn and
bring change to the organization
8 Pat Hanrahan
The definition of "data scientist" could be broadened to cover almost everyone
who works with data in an organization. At the most basic level, you are a data
scientist if you have the analytical skills and the tools to 'get' data, manipulate it
and make decisions with it
9 Monica Rogati
By definition all scientists are data scientists. In my opinion, they are half
hacker, half analyst, they use data to build products and find insights. It's
Columbus meet Columbo – starry eyed explorers and skeptical detectives
A data scientist is someone who can obtain, scrub, explore, model and interpret
Let’s talk about Data science
Data Scientist definitions database
Data Scientists through a SNA
‘’Let’s talk about Data science’’
‘’Let’s talk about Data science’’
A Lexical Correspondence analysis of 80 Data
Scientist’s definitions
Professional Data Scientists Researcher Data Scientists
1ST Axe: opposition between Researcher and Professional Data Scientists
A lemmas typology in 4 groups allows to identify different profiles of data scientist
’’Let’s talk about Data science’’
Data Scientists, Clusters 1 & 2 : central definitions
CLUSTER DEFINITION V.TEST
“ANALYZING
DATA FOR
KNOWLEDGE”
A data scientist basically needs to understand the data, extract
information and create meaningful data products out of it. There are
various technicalities involved in a data and despite software and
hardware constraints, a scientist with all his expertise and knowledge has
to crack the most complex data problems. Billions of people around the
globe interact and utilize social media platforms. But have you ever
wondered how so many accounts and the data are stored and kept
secured? Ever wondered how many accounts have been left
underutilized or unused? This is where the data scientist comes in and
uses his skills of getting an insight to the data, understand theories and
begin applying them. In this scenario, understanding the domain
expertise becomes very crucial (Patrao N.)
3,49
“SKILLS FOR
WORKING
WITH (BIG)
DATA”
Data Scientist is a job title for an employee , who excels at analyzing
data, particularly large amounts of data, to help a business gain a
competitive edge. A data scientist possesses a combination of analytic,
machine learning, data mining and statistical skills as well as experience
with algorithms and coding (Ramakrishna N)
5,92
SEMANTIC AREA: Researcher Professional
’’Let’s talk about Data science’’
Data Scientists, Clusters 3 & 4 : central definitions
“DEALING
WITH NEW
METHODO
LOGICAL
ISSUES”
Perform and interpret data studies and product experiments concerning
new data sources or new uses for existing data sources. Develop
prototypes, proof of concepts, algorithms, predictive models, and custom
analysis. Design and build new data set processes for modeling, data
mining, and production purposes. Determine new ways to improve data
and search quality, and predictive capabilities (Castillo M.)
5,14
“IT’S A
NEW JOB”
A data scientist represents an evolution from the business or data analyst
role. The formal training is similar, with a solid foundation typically in
computer science and applications, modeling, statistics, analytics and
math. What sets the data scientist apart is strong business acumen,
coupled with the ability to communicate findings to both business and IT
leaders in a way that can influence how an organization approaches a
business challenge. Good data scientists will not just address business
problems, they will pick the right problems that have the most value to
the organization (Ventura E.)
7,32
SEMANTIC AREA: Researcher Professional
’’Let’s talk about Data science’’
What does Data scientists do?
’’Let’s talk about Data science’’
From the point of view of the labour market
more Data scientist’s job titles appear
Some of the prominent are:
• Statistician
• Data Scientist
• Data Analyst
• Business Analyst
• Bus.Intelligence Manager
• Data/Analytics Manager
• Data Engineer
• Data Architect
• Data Administrator
DATA JOBS
Data job trends
Data Analyst. Data Analyst works to interpret data to get actionable insights for
the company. With a strong background in statistics and the ability to convert data
from a raw form to a different format (data munging), the Data Analyst collects,
processes and applies statistical algorithms to structured data.
•Responsibilities: Data collection and processing, programming, machine learning,
data munging, data visualization, applying statistical analysis
•Languages: R, Python, SQL, NOSQL, HTML, Java Script, C/C++, SPSS
Data Scientist A Data Scientist’s mission is similar to that of a Data Analyst’s: find
actionable insights that are key to a company’s growth and decision-making.
However, a Data Scientist role is needed in case of big data that require more
robust skills for sorting through a lot unstructured data to identify questions and
pull out critical information. The person then cleanses the data for proper analysis
and creates new algorithms to run queries that relate data from disparate sources.
On top of these skills, a Data Scientist also needs strong storytelling and
visualization skills to share insights with peers across the company.
•Responsibilities: Identifying questions, running queries,Data cleansing and
processing, predictive modeling, machine learning,, applying statistical analysis,
correlating disparate data, storytelling and visualization
•Languages: R, Python, SAS, Hive, MatLab, SQL, Pig, Spark, Hadoop
Data job descriptions
Data Architect. A Data Architect is the go-to person for data management,
especially when dealing with any number of disparate data sources. With an
extensive knowledge of how databases work, as well as how the acquired data
relates to the business’s operations, the Data Architect, ideally, is able to speculate
how changes will affect the company’s data use, then manipulate the data
architecture to compensate for them.
•Responsibilities: Data warehousing, ETL, architecture development, modeling
•Languages: Hive, SQL, Pig, Spark, XML
Data Engineer. This role is closely related to the Data Architect. The Data Engineer
also works on the management side of data, making some people think the titles
are interchangeable. However, a Data Engineer, who usually has a strong
background in software engineering, builds, tests and maintains the data
architecture.
•Responsibilities: ETL, installing data warehousing solutions, data modeling, data
architecture and development, database architecture testing
•Languages: R, Python, SAS, MatLab, SQL, NOSQL, Pig, Hadoop, Java, C/C++
Data job descriptions
‘’Let’s talk about Data science’’
40 techniques used by Data Scientists
Principal Component
Neural Networks
Support Vector Machine
Nearest Neighbors
Feature Selection
(Geo-) Spatial Modeling
Recommendation Engine
Search Engine
Attribution Modeling
Collaborative Filtering
Rule System
Linkage Analysis
Linear Regression
Logistic Regression
Jackknife Regression
Density Estimation
Confidence Interval
Test of Hypotheses
Pattern Recognition
Clustering
Supervised Learning
(classification)
Time Series
Decision Trees
Random Numbers
Monte-Carlo Simulation
Bayesian Statistics
Naive Bayes
Association Rules
Scoring Engine
Segmentation
Predictive Modeling
Graphs
Deep Learning
Game Theory
Imputation
Survival Analysis
Arbitrage
Lift Modeling
Yield Optimization
Cross-Validation
Model Fitting
Data science without statistics is possible, even desirable (Vincent Grenville @DSC 2014)!!!!
Statistics is Dead – Long Live Data Science… ( Lee Baker, @DSC 2016)!!!!!
‘’Let’s talk about Data science’’
Techniques used by Data Scientists
(Source:
KDNuggets 2017)
‘’Let’s talk about Data science’’
Data analysed by Data Scientist!!!!!
(Source:
KDNuggets)
‘’Let’s talk about Data science’’
Software used by Data Scientists
(Source:
KDNuggets)
‘’Let’s talk about Data science’’
Largest data analysed by Data Scientist
(Source:
KDNuggets)
’’Let’s talk about Data science’’
How to become a data scientist?
Meeting the need
School...core mathematics
BSc...continue to focus on
single disciplines, especially
mathematics (including
probability) and computing
MSc...increase focus on
statistics, begin to develop
interdisciplinarity, but beware
of “cut-and-paste data
science" curricula.
PhD...encourage
interdisciplinary and team-
based projects-
PostDoc...focus on training
fellowships, to include
migrants
from other disciplines
(Peter J Diggle, 2015)
Suggestions for a MSc in Business Data Science
‘’Let’s talk about Data science’’
Data Science challenges for Statistics
According to a recent poll by Kdnuggets, the big majority (68%) of the audience
thought that in the Era of Big Data, Statistics will become more important, as the
foundation of Data Science.
The rise of Data Science could be seen as a potential threat to the long-term status of
the statistics discipline ….. but there is also a much greater opportunity to re-
emphasize the universal relevance of the statistical thinking to the interpretation
and exploiting of data, by improving links between statistics and information
technology but also with those communities characterized by new and big data.
We hope that the Statistician will be able to take this opportunity by developing new
methods in a knowledge domain perspective, i.e
• Computational Statistics Knowledge based
• Statistical & Algorithmic intelligent data analysis
contributing as well to the Data Science needs of the different scientific and
professional domains implying new and big data.
The cooperation between statisticians and computer scientists in the data revolution
era, will allow to face in a proper way data management and preparation problems
(data extraction, data and source integration, data cleaning and validation, knowledge
coding). This task requires more than 70% of the whole data processing . It has a
strong impact on the data quality and consequently on the data science results and
actionable knowledge.
“Big data” is everywhere. The term was
added to the Oxford English Dictionary in
2013. Now, Gartner’s just-released 2017
Hype Cycle that shows “big data”
passing the “peak of inflated
expectations” and moving on its way
down into the “trough of
disillusionment.” Big data is all the rage.
But what does it actually mean?
We analysed more then 45 definitions
registered on a blog at Berleley
A commonly repeated definition cites
the three Vs: volume, velocity, and
variety. But others argue that it’s not
the size of data that counts, but the
tools being used or the insights that
can be drawn from a datase
‘’Let’s talk about Data science’’
About BiG Data ……
‘’Let’s talk about Data science’’
A Lexical Correspondence analysis of 45 Big Data definitions
1ST Axe: opposition between Academic Authors and Professional
Data Scientists
A lemmas typology in 4 groups allows to identify different profiles
of data scientist definitions (Academics, Influencers, DS Managers,
DS Professionals )
Big Data Definitions – 4 class Typology – The central definitions
• The first group (which contains 50% of the lemmas) concerns definitions that
aim to identify the characterizing traits of the concept of big data and
therefore "complex", "dataset", "large" and the concepts related to it as
"analysis" and "technique". In this group fall definitions in a certain
mainstream way, definitions in which the key words usually used to describe
the phenomenon abound. Among the definitions that represent this group
are: "As computational efficiency continues to increase," "big data" will be less
about the actual size of a particular data and more about the specific
expertise needed to process it. big data "will ultimately describe any datasets
large enough to need high-level programming skills and statistically defensible
methodologies in order to transform the data asset into something of value“.
• In the second group fall those definitions that try to contextualize the
phenomenon of big data, in this group we find many concepts related to the
temporal dimension as "time", "now" and "new". Among the most
representative of this group we find: "Big data, which started as a
technological innovation in distributed computing, is now a cultural
movement by which we continue to discover how humanity interacts with the
world - and other - at large-scale" .
Big Data Definitions – 4 class Typology – The central definitions
• Nel terzo gruppo invece troviamo le definizioni che danno anche prospettive
extra-economiche dei big-data, riflettono su come i big-data potrebbero
essere utili all'umanità intera e non solo in senso economico. In questo
gruppo troviamo concetti come: "world", "people" "possibility". Tra le
definizioni più rappresentative di questo gruppo troviamo: Big data is an
umbrella term that means a lot of different things, but to me, it means the
possibility of doing extraordinary things using modern machine learning
techniques on digital data. Whether it is predicting illness, the weather, the
spread of infectious diseases, or what you will buy next, it offers a world of
possibilities for improving people’s lives.
• Infine nell'ultimo gruppo le definizioni molto tecniche mirate anche alla
promozione in senso economico dei big-data come quest’ultima: [Big data
means] harnessing more sources of diverse data where “data variety” and
“data velocity” are the key opportunities. (Each source represents “a signal”
on what is happening in the business.) The opportunity is to harness data
variety [and] automate “harmonization” of data sources to deliver fast-
updating insights consumable by the line-of-business users.
1st class (31). AnnaLee Saxenian, Dean, UC Berkeley School of Information (Academic)
I’m not fond of the phrase “big data” because it focuses on the volume of data, obscuring the far-reaching changes
are making data essential to individuals and organizations in today’s world. But if I have to define it I’d say that “big
data” is data that can’t be processed using standard databases because it is too big, too fast-moving, or too complex
for traditional data processing tools.
2nd class (28). Gregory Piatetsky-Shapiro, President and Editor, KDnuggets.com (influencer)
The best definition I saw is, “Data is big when data size becomes part of the problem.” However, this refers to the
size only. Now the buzzword “big data” refers to the new data-driven paradigm of business, science and technology,
where the huge data size and scope enables better and new services, products, and platforms. #BigData also
generates a lot of hype and will probably be replaced by a new buzzword, like “Internet of Things,” but “big data”-
enabled services companies, like Google, Facebook,
3rd class (5). Mike Cavaretta, Data Scientist Consultant (DS Consultant)
You cannot give me too much data. I see big data as storytelling — whether it is through information
graphics or other visual aids that explain it in a way that allows others to understand across sectors. I
always push for the full scope of the data over averages and aggregations — and I like to go to the
raw data because of the possibilities of things you can do with it.
4th class (22). Sharmila Mulligan, CEO and Founder, ClearStory Data (DS Manager dirigente)
[Big data means] harnessing more sources of diverse data where “data variety” and “data velocity”
are the key opportunities. (Each source represents “a signal” on what is happening in the business.)
The opportunity is to harness data variety [and] automate “harmonization” of data sources to deliver
fast-updating insights consumable by the line-of-business users.3
Big Data Definitions – 4 class Typology – The central
definitions
‘’Let’s talk about Data science’’
Big Data challenges for Statistics
Big data problems usually require multidisciplinary teams by their nature. They typically
require knowledge domain experts, computational experts, machine learning experts, data
miners and statisticians.
• In particular Statisticians help translate the scientific question into a statistical
question, which includes carefully describing data structure; the underlying system that
generated the data (the model); and what we are trying to assess (the parameters we wish
to estimate) or predict.
What does Statistics bring to Big Data and where are the opportunities?
• Statistics is fundamental to ensuring meaningful, accurate information is extracted from
Big Data especially for the following:
o Data quality ;
o Missing and incomplete data;
o Quantification of the uncertainty of predictions, forecasts and models.
Statisticians are skillful at validation and correcting for bias; measuring uncertainty;
designing studies and sampling strategies; data quality assessing and certification ;
enumerating limitations of studies; dealing with issues such as missing data and other
sources of non-sampling error; developing models for the analysis of complex data
structures; creating methods for causal inference and comparative effectiveness;
eliminating redundant and uninformative variables; data integration from multiple sources.
Data Scientist:: No thanks!
In order to conduct my business I need
Big Data Informative Data Information
Informative Data
Big
Data
Knowledg
e
ID
ID PROCESSING
The most important thing about data is not its
size but its informative content
DECISION
Data Engineer:: Big Data ?
Some Knowledge representation tools:
Interval, Histogram. Logical rules; Hierarchical rules;
Probability models;
Graphs; Network; Metadata; Ontologies….
Theory without data is blind. Data without knowledge is lame
An useful approach to ID
PROCESSING:
SYMBOLIC DATA ANALYSIS
Big Data Challenges for Data Scientists/Statisticians
Informative Data as Symbolic Data Table
SDT fig. in: E. Diday,Thinking by classes
in data science: The symbolic data
analysis paradigm. Wires , Vol. 8,, Sept.
Oct., 2016
Symbolic Data Analysis tools as descriptive statistics, PCA, regression, decision trees, clustering, have been
developed in order to analyze and discover new knowledge from Data.
«A SDT quality can
be measured in
terms of
explanatory and
discriminatory
power of its
symbolic features»
c
A SDT offers a rapresentation of the variabiliy we find in the BIG DATA
F. Brambilla: «Statistics is the Science that studies the vatiability of phenomena»
Knowledge Discovery is a sequential learning process
Supervised statistical methods allow investigators to produce
new knowledge!
`
Knowledge encoding & data integration
Knowledge Pyramid
Toward a Data Science Knowledge Based
Conclusion: toward a Knowledge Based Data Science
Data Science is an interdisciplinary approach, to meet the new challenges of the
Information Society. It is based mainly on the methods of Statistics and
Computational Science suitably supplemented by the Knowledge of the different
domains.
Computational Science represents the language of the Data Science whereas
Statistics is the Logic of the Data Science itself. The Knowledge of the various
domain of interest constitutes the prerequisite of a Data Science. Thus, from this
point of view, it would be preferable to speak about DATA SCIENCES.
The main novelty in the Data Sciences is played by the role of the KNOWLEDGE.
Its encoding in a proper way (intervals, histograms, functions, logical rules or
hierarchies, graphs, metadata, ontologies, etc….) can be used in the different step
of a Data Science exercise:
- in automating the step of (Big) Data cleaning and refinement (feature selection);
- to obtain new (BiG) data representation in term of Informative Data;
- to drive data processing methods in the right/expected direction avoiding
trivial results;
- to allow coherent interpretation of results and enrich storytelling;
- to perform suitable decisions.
For these reasons I like to call such an approach as a
Knowledge based Data Science
Prolegomena
to Any Future Statistics,
that will be able to present itself
as a (Data) Science
Carlo Lauro
Emeritus Professor of Statistics
University of Naples Federico II
THANK YOU FOR
YOUR ATTENTION!!
carlo.lauro@unina.it
carlo.lauro@selfschool.ch
Scientific Meeting
in Memory of Simona Balbi
Naples, February 19° , 2019
Director of the Department of
Economic & Managerial Sciences
In Dnformation Era
SELF HOCHSCHULE, ZUG - CH

Mais conteúdo relacionado

Semelhante a Let's talk about Data Science

Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of DataPaul Groth
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Sciencedatasciencekorea
 
Ict와 사회과학지식간 학제간 연구동향(23 march2013)
Ict와 사회과학지식간 학제간 연구동향(23 march2013)Ict와 사회과학지식간 학제간 연구동향(23 march2013)
Ict와 사회과학지식간 학제간 연구동향(23 march2013)Han Woo PARK
 
Design Science in Information Systems
Design Science in Information SystemsDesign Science in Information Systems
Design Science in Information SystemsSergej Lugovic
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learningGiuseppe Manco
 
data science and its role in big data analytics.pptx
data science and its role in big data analytics.pptxdata science and its role in big data analytics.pptx
data science and its role in big data analytics.pptxAkashVerma168555
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptSangrangBargayary3
 
Sci 2011 big_data(30_may13)2nd revised _ loet
Sci 2011 big_data(30_may13)2nd revised _ loetSci 2011 big_data(30_may13)2nd revised _ loet
Sci 2011 big_data(30_may13)2nd revised _ loetHan Woo PARK
 
Data science innovations
Data science innovations Data science innovations
Data science innovations suresh sood
 
What Data Can Do: A Typology of Mechanisms . Angèle Christin
What Data Can Do: A Typology of Mechanisms . Angèle Christin What Data Can Do: A Typology of Mechanisms . Angèle Christin
What Data Can Do: A Typology of Mechanisms . Angèle Christin eraser Juan José Calderón
 
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...NOVA DATASCIENCE
 
Gobinda Chowdhury
Gobinda ChowdhuryGobinda Chowdhury
Gobinda Chowdhurymaredata
 
Digital Data Sharing: Opportunities and Challenges of Opening Research
Digital Data Sharing: Opportunities and Challenges of Opening ResearchDigital Data Sharing: Opportunities and Challenges of Opening Research
Digital Data Sharing: Opportunities and Challenges of Opening ResearchMartin Donnelly
 
Scientific software engineering methods and their validity
Scientific software engineering methods and their validityScientific software engineering methods and their validity
Scientific software engineering methods and their validityDaniel Mendez
 
Mapping (big) data science (15 dec2014)대학(원)생
Mapping (big) data science (15 dec2014)대학(원)생Mapping (big) data science (15 dec2014)대학(원)생
Mapping (big) data science (15 dec2014)대학(원)생Han Woo PARK
 
The End(s) of e-Research
The End(s) of e-ResearchThe End(s) of e-Research
The End(s) of e-ResearchEric Meyer
 
Decomposing Social and Semantic Networks in Emerging “Big Data” Research
Decomposing Social and Semantic Networks in Emerging “Big Data” ResearchDecomposing Social and Semantic Networks in Emerging “Big Data” Research
Decomposing Social and Semantic Networks in Emerging “Big Data” ResearchHan Woo PARK
 
Accessing and Using Big Data to Advance Social Science Knowledge
Accessing and Using Big Data to Advance Social Science KnowledgeAccessing and Using Big Data to Advance Social Science Knowledge
Accessing and Using Big Data to Advance Social Science KnowledgeJosh Cowls
 

Semelhante a Let's talk about Data Science (20)

Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Science
 
Ict와 사회과학지식간 학제간 연구동향(23 march2013)
Ict와 사회과학지식간 학제간 연구동향(23 march2013)Ict와 사회과학지식간 학제간 연구동향(23 march2013)
Ict와 사회과학지식간 학제간 연구동향(23 march2013)
 
Design Science in Information Systems
Design Science in Information SystemsDesign Science in Information Systems
Design Science in Information Systems
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
data science and its role in big data analytics.pptx
data science and its role in big data analytics.pptxdata science and its role in big data analytics.pptx
data science and its role in big data analytics.pptx
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .ppt
 
Sci 2011 big_data(30_may13)2nd revised _ loet
Sci 2011 big_data(30_may13)2nd revised _ loetSci 2011 big_data(30_may13)2nd revised _ loet
Sci 2011 big_data(30_may13)2nd revised _ loet
 
The Fourth Paradigm Book
The Fourth Paradigm BookThe Fourth Paradigm Book
The Fourth Paradigm Book
 
Data science innovations
Data science innovations Data science innovations
Data science innovations
 
What Data Can Do: A Typology of Mechanisms . Angèle Christin
What Data Can Do: A Typology of Mechanisms . Angèle Christin What Data Can Do: A Typology of Mechanisms . Angèle Christin
What Data Can Do: A Typology of Mechanisms . Angèle Christin
 
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...
 
Lecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptxLecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptx
 
Gobinda Chowdhury
Gobinda ChowdhuryGobinda Chowdhury
Gobinda Chowdhury
 
Digital Data Sharing: Opportunities and Challenges of Opening Research
Digital Data Sharing: Opportunities and Challenges of Opening ResearchDigital Data Sharing: Opportunities and Challenges of Opening Research
Digital Data Sharing: Opportunities and Challenges of Opening Research
 
Scientific software engineering methods and their validity
Scientific software engineering methods and their validityScientific software engineering methods and their validity
Scientific software engineering methods and their validity
 
Mapping (big) data science (15 dec2014)대학(원)생
Mapping (big) data science (15 dec2014)대학(원)생Mapping (big) data science (15 dec2014)대학(원)생
Mapping (big) data science (15 dec2014)대학(원)생
 
The End(s) of e-Research
The End(s) of e-ResearchThe End(s) of e-Research
The End(s) of e-Research
 
Decomposing Social and Semantic Networks in Emerging “Big Data” Research
Decomposing Social and Semantic Networks in Emerging “Big Data” ResearchDecomposing Social and Semantic Networks in Emerging “Big Data” Research
Decomposing Social and Semantic Networks in Emerging “Big Data” Research
 
Accessing and Using Big Data to Advance Social Science Knowledge
Accessing and Using Big Data to Advance Social Science KnowledgeAccessing and Using Big Data to Advance Social Science Knowledge
Accessing and Using Big Data to Advance Social Science Knowledge
 

Último

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 

Último (20)

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 

Let's talk about Data Science

  • 1. Prolegomena to Any Future Statistics, that will be able to present itself as a (Data) Science Carlo Lauro Emeritus Professor of Statistics University of Naples Federico II Is there a Data Science? If yes, then what is Data Science? And what does Data Science mean in “data revolution era”? What about new professions? What are the challenges for Statistics? (LET'S TALK ABOUT DATA SCIENCE) carlo.lauro@unina.it carlo.lauro@selfschool.ch Scientific Meeting in Memory of Simona Balbi Naples, February 19° , 2019 Director of the Department of Economic & Managerial Sciences in Digital Era SELF HOCHSCHULE, ZUG - CH
  • 2. • “Data Science: The Sexiest Job of the 21st Century” (T. Davenport & D.J. Patil) • “Data Scientist : Person who is better at statistics than any software engineer and better at software engineering than any statistician.” (Josh Wills, Cloudera ) Is Data Science still a buzzword without a clear definition? Is Data Science just a rebranding of Statistics? ‘’Let’s talk about Data Science’’ Data Science and Data Scientists
  • 3. ‘’Let’s talk about Data Science’’ According with Sir Maurice Kendall, among the issues the statisticians do not agree, there is the definition of their science. As a consequence, dictionaries and encyclopedias, do not share a common idea on what Statistics is. Similar problems seem to happen analysing the scientific literature on the subject matter as well as the various forum and blogs present in social networks where a common definition for Data Science is The Science of extraction the knowledge from the Data the same one used in Statistics. As for Statistics, we observed also another a common view , ‘’Data Science is what Data scientists do ‘’. So far is unclear if a Data science is a science or a profession? The Data Science Association introduce itself as a profession. Probably a Data Science is both. In fact it has the peculiarty of a ‘Methodological Science’ (Tosio Kitagawa) with no object but its object is to develop a unified methodology applicable to other categories of sciences. With the aim to propose a satisfactory definition to the different people that coexist in this colorful world of the Data science we analysed about 150 Data Science and Data scientist definitions by a lessical corrispondence analysis and a SNA. But what is also more relevant for us is to try understand eventual threats and challenges that can derive for Statistics and statisticians as consequence of the actual data revolution characterized by large amounts of data (big data) of various types (numeric, ordinal, nominal, symbolic, texts, images, data streams, multi-way, networks ,etc.), coming from disparate sources (surveys, administrative data,social media, sensors, transactions, open data).
  • 4.
  • 5. ‘’Let’s talk about Data science ‘’ A short history of Data Science (Forbes Magazine, May ’13) 1962 John W. Tukey writes “The Future of Data Analysis” 1974 Peter Naur publishes Concise Survey of Computer Methods in Sweden 1977 The International Association for Statistical Computing (IASC) is established as a Section of the ISI. “It is the mission of the IASC to link traditional statistical methodology, modern computer technology, and the knowledge of domain experts in order to convert data into information and knowledge.” 1989 Gregory Piatetsky-Shapiro organizes and chairs the first Knowledge Discovery in Databases (KDD) workshop. 1993 J. Chambers presents the concept of learning from data as a challenges as well as exciting opportunities for Statistics. 1996 The International Federation of Classification Societies (IFCS) for the first time, uses the term in the conference “Data science, classification, and related methods”. 1996 Usama Fayyad, Gregory Piatetsky- Shapiro, and Padhraic Smyth publish “From Data Mining to Knowledge Discovery in Databases.” 1997C.F. Jeff Wu : “Statistics = Data Science?” 2001 William S. Cleveland publishes “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.” 2002/2003 Launch of Data Science Journal / Launch of Journal of Data Science 2007 The Research Center for Dataology and Data Science is set at Fudan University, China. 2010 Mike Loukides writes in “What is Data Science”. Drew Conway “DS Venn diagram” 2012 Tom Davenport & D.J Patil, “Data Scientist: The Sexiest Job of the 21st Century”
  • 6. Tukey 1962: “…my central interest is in data analysis, which I take to include, among other things: Procedures for analysing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analysing data…” Tukey identified four driving forces in the new science: “Four major influences act on data analysis today: 1. The formal theories of statistics 2. Accelerating developments in computers and display devices 3. The challenge, in many fields, of more and ever larger bodies of data 4. The emphasis on quantification in an ever wider variety of disciplines” ‘’Let’s talk about Data science ‘’
  • 7. ‘’Let’s talk about Data science ‘’ The origin of Data Science: the Benzecri’s 5 principles of Data Analysis Forbes published "A Very Short History of Data Science" but may too short as it forgets the fundamental contribution by JP Benzecri in the 60's. In the book "L'analyse des données" published by Dunod, Benzecri in 1973 for the first time sets out the 5 major principles on which Data analysis have to be based . • The first principle states that "The statistics is not probability, under the name of (mathematical) statistics was built a pompous discipline based on theoretical assumptions that are rarely met in practice." • The second principle states that "the models should follow the data., not vice versa." In fact is asserting the priority of the data or the approach to the extraction of knowledge in an optical data-driven. • The third specifies that "you must simultaneously process the information relating to the greater number of possible dimensions so as to provide a sufficiently complete representation of the phenomena of interest." It seems that in this principle advances the role of the"big data", • Finally, the last two principles relate to the basic use of the computer to process the data "for the analysis of complex phenomena (facts) the computer is indispensable" and even "use the computer implies the abandonment of all the techniques designed before of computing ". This latter principle advocates the change of the paradigm of classical statistics.
  • 8. Paradigm Nature Form When First Experimental science Empiricism; describing natural phenomena pre-Renaissance Second Theoretical science Modelling and generalization pre-computers Third Computational science Simulation of complex phenomena pre-big data Fourth Exploratory science /Data Science Data-intensive; statistical exploration and data mining Now CHANGE OF PARADIGM IN SCIENCE By Science (Wikipedia) «we mean a system of knowledge obtained through an organized research activity and with methodical and rigorous procedures (the scientific method), with the aim of reaching, through tests , a description, likely, objective and predictive. , of reality and laws that regulate the occurrence of phenomena». The data revolution characterized by large amounts of data (big data) of various types (numeric, ordinal, nominal, symbolic, texts, images, data streams, multi-way, networks ,etc.), coming from disparate sources (surveys, administrative and official data, social media, sensors, transactions, open data) offers great opportunities to enhance knowledge on many key research areas that will bring a strong change in the paradigm of a science.
  • 9. Data revolution : more and new data Stream data Symbolic data Multi sources data Text data High dimensional data Multimedia data Network data Complex data
  • 10. To be termed scientific a method to acquiring scientific knowledge is commonly based on empirical or measurable evidence subject to specific principles of reasoning. The Oxford Dictionaries Online defines the scientific method as "a method or procedure that has characterized natural science since the 17th century, consisting in: (1) systematic observation; (2) hypotheses formulation ; (3) perform an experiment; (4) collection and analysis data to confirm (testing) hypotheses . If rejected back to (2 ) and refine 0r alterate hypothesis; (5) report findings and (6) assure results reproducibility to develop a theory or take action. Experiments are an important tool of the scientific method. The best hypotheses lead to predictions that can be tested in various ways. The strongest tests of hypotheses come from carefully controlled experiments that gather empirical data. Data Scientists use the scientific method?
  • 11. The Data Science Method 1.Problem Identification 2.Data Collection, Organization, and Definitions 3.Exploratory Data Analysis 4.Pre-processing and Training Data Development 5.Fit Models with Training Data Set 6.Review Model Outcomes—Iterate over additional models as needed. 7.Identify the Final Model 8.Apply the Model to the Complete Data Set 9.Review the Results—Share your findings 10.Finalize Code and Documentation How to take a data science projects by using a methodological approach similar to the scientific method coined the Data Science Method. The biggest difference between people that are successful as data scientists and those that are not, is their ability to effectively frame data science projects and communicate project outcomes.
  • 13. Let’s talk about Data science Data Science definitions data base DATA SCIENCE year defininition pagina web A field of big data which seeks to provide meaningful information from large amounts of complex data. Data Science combines different fields of work in statistics and computation in order to interpret data for the purpose of decision making 2 accademico 2014 A major goal of Data Science is to make it easier for others to find and coalesce data with greater ease. Data Science technologies impact how we access data and conduct research across various domains, including the biological sciences, medical informatics, social sciences and the humanities. 2 accademico 2010 Ability to] obtain, scrub, explore, model and interpret data, blending hacking, statistics, and machine learning 1 professionist a 2010 An unfortunate, unclear and misleading term that has emerged recently which refers to some subset of activities in the overall knowledge discovery process. What additional descriptive power data science provides beyond data mining and knowledge discovery is unclear. 2 accademico 2017 Data Science aims to transform data into actionable knowledge to perform predictions as well to support and validate decisions. Computer Science represents the language of the Data Science whereas Statistics is the Logic of the Data Science itself. However, in this process the domain expertise constitutes the catalytic element in the absence of which the transformation cannot be achieved". 2 accademico 2012 Data Science becomes clear pretty quickly that data science has two parents in traditional academia: statistics and computer science.(
  • 14. Data Science through a SNA ‘’Let’s talk about Data science’’
  • 15. ‘’Let’s talk about Data science’’ A Lexical Correspondence analysis of 70 DS definitions 1st axe: opposition of Research and Professional DS. 2nd axe: opposition of domain Data Sciences A typology according 4 Clusters: Epistemology DS, Methodology DS, Social DS, Business DS
  • 16. ’’Let’s talk about Data science’’ Cluster analysis of Data Science: central definitions First group: Data Science Epistemology 18 Dataology and Data Science emphasizes on both theories and technologies, more importantly, it studies the laws in datanature not only ones in nature. It would represent the future direction and have breakthrough in the near future 16 Dataology and Data Science is an umbrella of theories, methods and technologies for studying phenomena and laws of datanature Second group: Data driven (Social) Data Science 3 Whether... data is search terms, voice samples, or product reviews,... users are in a feedback loop in which they contributeto the products they use. That's the beginning of Data Science. 46 Data science, also known as data-driven science, is an interdisciplinary field about scientific methods, processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured Third group: Business Data Science 21 So far the main goal of Data Science is to provide a statistical framework for studying the problem of gaining knowledge, making predictions, making decisions or constructing models for specific domains. 20 It may be helpful to think of Data Science and business intelligence as being on two ends of the same spectrum, with business intelligence focused on managing and reporting existing business data in order to monitor or manage various concerns within the enterprise. In contrast, Data Science applies advanced analytical tools and algorithms to generate predictive insights and new product innovations that are a direct result of the data 29 Data Science aims to transform data into actionable knowledge to perform predictions as well to support and validate decisions. Computer Science represents the language of the Data Science whereas Statistics is the Logic of the Data Science itself. However, in this process the domain expertise constitutes the catalytic element in the absence of which the transformation cannot be achieved". Fourth group: Data Science Methodology 22 Data Science incorporates varying elements and builds on techniques and theories from many fields, including mathematics, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. 49 Data science” is the general analysis of the creation of data. This means the comprehensive understanding of where data comes from, what data represents, and how to turn data into actionable information (something upon which we can base decisions). This encompasses statistics, hypothesis testing, predictive modeling, and understanding the effects of performing computations on data, among other things. Science in general has been armed with many of these tools, but data science pools the necessary tools together to provide a scientific discipline to the analysis and productizing of data.
  • 17. Summarizing Data Science is ….. Data Science is an interdisciplinary approach, based mainly on the methods of Computational Science and Statistics suitably supplemented by the Knowledge of the different domains to meet the new challenges posed by the l Information Society. Computational Science represents the language of the Data Science whereas Statistics is the logic of the Data Science itself. The Knowledge of the various domain of interest constitutes the prerequisite of a Data Science. Thus, from this point of view, it would be preferable to speak about DATA SCIENCES. Data Sciences adopt and/or develop appropriate methodologies for purposes of knowledge discovery, forecasting and decision-making in the face of an increasingly complex reality often characterized by large amounts of data (big data) of various types (numeric, ordinal, nominal, symbolic data, texts, images, data streams, multi-way data, networks etc.), coming from disparate sources (surveys, official data,social media, sensors, transactions, open data). The main novelty in the Data Sciences is played by the role of the KNOWLEDGE. Its encoding in the form of logical rules or hierarchies, graphs, metadata, ontologies, will represent a new and more effective perspective to data analysis and interpretation of results if properly integrated in the methods of a Data Science. It is in this sense that the a Data Science is a discipline whose methods, result of the intersection between Statistics, Computer Science and a Knowledge Domain, that has as its purpose to give meaning to the data. Alternatively A Data Science can be defined as A Knowledge based Computational Statistics, or “Intelligent” Computational/Statistical Data Analysis.
  • 18. Data Science = Knowledge based or ‘Intelligent’ Computational Statistics = ‘Intelligent’ Computational or Statistical Data Analysis Some CS tools: Data extraction and preparation; Data Warehousing; Optimization and numerical algorithms; Simulation; High Performance Computing; R; Hadoop; Python; SAS; Rapid Miner;Tableau;; Visualization ; Data Mining; A. I.; ANN; Machine Learning ;….. Some Stat tools: Exploratory methods ; Density estimation; Regression; Time series; Causal Models and SEM; Bayesian models; Factorial analysis and PCA; Cluster analysis; Classification; SNA …… Some Knowledge representation tools: Logical rules; Hierarchical rules; Probability models; Graphs; Network; Metadata; Ontologies…. The Data Science curvilinear triangle a DS definition by Carlo Lauro The Data Science adopts and/or develops appropriate methodologies for purposes of knowledge discovery, prediction and decision- making in the face of an increasingly complex reality often characterized by large amounts of data (big data) of various types (numeric, ordinal,nominal, symbolic, texts,images, data streams,multi-way, networks ,etc.),comingfrom disparate sources (surveys, official data, socialmedia,sensors,transactions,opendata,etc.) The role of Knowledge in DS SDA -> Data = Model + Error STATISTICS COMPUTATIONAL SCIENCE DS Computational Statistics Statistical Data Analysis KNOWLEDGE DOMAIN Computational Data Analysis CDA -> Data = Algorithm + Accuracy (The 2 cultures, Breiman) Data Science (DS) is an interdisciplinary approach to meet the challenges of the Information Society, based on the methods of Computational Science and Statistics supplemented by Knowledge of the different domains. Computational Science represents the language of the Data Science, whereas Statistics is its logic. The Knowledge of various domains of interest constitutes the prerequisite of a Data Science.
  • 19. Computational science (also scientific computing ) is a rapidly growing multidisciplinary field that uses advanced computing capabilities to understand and solve complex problems. It is an area of science which spans many disciplines, but at its core it involves the development of models and simulations to understand natural systems. Computational science is now commonly considered a third mode of science, complementing and adding to experimentation/observation and theory. Substantial effort in computational sciences has been devoted to the development of algorithms (numerical and non-numerical), computer simulations, their efficient implementation in programming languages, and validation of the results to solve science, engineering, and humanities problems. Computational scientist should be capable of: - recognizing complex problems; adequately conceptualise the system containing these problems; design algorithms suitable for studying this system; - choose a suitable computing infrastructure (parallel computing / grid computing /supercomputers) - maximising the computational power of the simulation; assessing to what level the output of the simulation resembles the systems i.e. the model is validated; adjust the conceptualisation of the system accordingly; repeat cycle until a suitable level of validation is obtained. The computational scientists trusts that the simulation generates adequately realistic results for the system, under the studied condition. Not to be confused with Computer and information science that develops and optimizes the advanced system hardware, software, networking, and data management components needed to solve computationally demanding problems.
  • 20. ‘’Let’s talk about Data Science’’ Data Scientist vs Statistician on Google citations Data Scientist Statistician
  • 21. Let’s talk about Data science Data Scientist : ID AUTHOR DEFINITIONS 1 DJ Patil A data scientist is that unique blend of skills that can both unlock the insights of data and tell a fantastic story via the data 2 Mike Loukides Data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others 3 Jake Porway A data scientist is a rare hybrid, a computer scientist with the programming abilities to build software to scrape, combine, and manage data from a variety of sources and a statistician who knows how to derive insights from the information within. She combines the skills to create new prototypes with the creativity and thoroughness to ask and answer the deepest questions about the data and what secrets it holds 4 Steve Hillion analytically-minded, statistically and mathematically sophisticated data engineers who can infer insights into business and other complex systems out of large quantities of data 5 Hillary Mason A data scientist is someone who blends, math, algorithms, and an understanding of human behavior with the ability to hack systems together to get answers to interesting human questions from data 6 Anjul Bhambhri A data scientists is part digital trendspotter and part storyteller stitching various pieces of information together 7 Malcolm Chisholm A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It’s almost like a Renaissance individual who really wants to learn and bring change to the organization 8 Pat Hanrahan The definition of "data scientist" could be broadened to cover almost everyone who works with data in an organization. At the most basic level, you are a data scientist if you have the analytical skills and the tools to 'get' data, manipulate it and make decisions with it 9 Monica Rogati By definition all scientists are data scientists. In my opinion, they are half hacker, half analyst, they use data to build products and find insights. It's Columbus meet Columbo – starry eyed explorers and skeptical detectives A data scientist is someone who can obtain, scrub, explore, model and interpret Let’s talk about Data science Data Scientist definitions database
  • 22. Data Scientists through a SNA ‘’Let’s talk about Data science’’
  • 23. ‘’Let’s talk about Data science’’ A Lexical Correspondence analysis of 80 Data Scientist’s definitions Professional Data Scientists Researcher Data Scientists 1ST Axe: opposition between Researcher and Professional Data Scientists A lemmas typology in 4 groups allows to identify different profiles of data scientist
  • 24. ’’Let’s talk about Data science’’ Data Scientists, Clusters 1 & 2 : central definitions CLUSTER DEFINITION V.TEST “ANALYZING DATA FOR KNOWLEDGE” A data scientist basically needs to understand the data, extract information and create meaningful data products out of it. There are various technicalities involved in a data and despite software and hardware constraints, a scientist with all his expertise and knowledge has to crack the most complex data problems. Billions of people around the globe interact and utilize social media platforms. But have you ever wondered how so many accounts and the data are stored and kept secured? Ever wondered how many accounts have been left underutilized or unused? This is where the data scientist comes in and uses his skills of getting an insight to the data, understand theories and begin applying them. In this scenario, understanding the domain expertise becomes very crucial (Patrao N.) 3,49 “SKILLS FOR WORKING WITH (BIG) DATA” Data Scientist is a job title for an employee , who excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge. A data scientist possesses a combination of analytic, machine learning, data mining and statistical skills as well as experience with algorithms and coding (Ramakrishna N) 5,92 SEMANTIC AREA: Researcher Professional
  • 25. ’’Let’s talk about Data science’’ Data Scientists, Clusters 3 & 4 : central definitions “DEALING WITH NEW METHODO LOGICAL ISSUES” Perform and interpret data studies and product experiments concerning new data sources or new uses for existing data sources. Develop prototypes, proof of concepts, algorithms, predictive models, and custom analysis. Design and build new data set processes for modeling, data mining, and production purposes. Determine new ways to improve data and search quality, and predictive capabilities (Castillo M.) 5,14 “IT’S A NEW JOB” A data scientist represents an evolution from the business or data analyst role. The formal training is similar, with a solid foundation typically in computer science and applications, modeling, statistics, analytics and math. What sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge. Good data scientists will not just address business problems, they will pick the right problems that have the most value to the organization (Ventura E.) 7,32 SEMANTIC AREA: Researcher Professional
  • 26. ’’Let’s talk about Data science’’ What does Data scientists do?
  • 27. ’’Let’s talk about Data science’’ From the point of view of the labour market more Data scientist’s job titles appear Some of the prominent are: • Statistician • Data Scientist • Data Analyst • Business Analyst • Bus.Intelligence Manager • Data/Analytics Manager • Data Engineer • Data Architect • Data Administrator
  • 29. Data Analyst. Data Analyst works to interpret data to get actionable insights for the company. With a strong background in statistics and the ability to convert data from a raw form to a different format (data munging), the Data Analyst collects, processes and applies statistical algorithms to structured data. •Responsibilities: Data collection and processing, programming, machine learning, data munging, data visualization, applying statistical analysis •Languages: R, Python, SQL, NOSQL, HTML, Java Script, C/C++, SPSS Data Scientist A Data Scientist’s mission is similar to that of a Data Analyst’s: find actionable insights that are key to a company’s growth and decision-making. However, a Data Scientist role is needed in case of big data that require more robust skills for sorting through a lot unstructured data to identify questions and pull out critical information. The person then cleanses the data for proper analysis and creates new algorithms to run queries that relate data from disparate sources. On top of these skills, a Data Scientist also needs strong storytelling and visualization skills to share insights with peers across the company. •Responsibilities: Identifying questions, running queries,Data cleansing and processing, predictive modeling, machine learning,, applying statistical analysis, correlating disparate data, storytelling and visualization •Languages: R, Python, SAS, Hive, MatLab, SQL, Pig, Spark, Hadoop Data job descriptions
  • 30. Data Architect. A Data Architect is the go-to person for data management, especially when dealing with any number of disparate data sources. With an extensive knowledge of how databases work, as well as how the acquired data relates to the business’s operations, the Data Architect, ideally, is able to speculate how changes will affect the company’s data use, then manipulate the data architecture to compensate for them. •Responsibilities: Data warehousing, ETL, architecture development, modeling •Languages: Hive, SQL, Pig, Spark, XML Data Engineer. This role is closely related to the Data Architect. The Data Engineer also works on the management side of data, making some people think the titles are interchangeable. However, a Data Engineer, who usually has a strong background in software engineering, builds, tests and maintains the data architecture. •Responsibilities: ETL, installing data warehousing solutions, data modeling, data architecture and development, database architecture testing •Languages: R, Python, SAS, MatLab, SQL, NOSQL, Pig, Hadoop, Java, C/C++ Data job descriptions
  • 31. ‘’Let’s talk about Data science’’ 40 techniques used by Data Scientists Principal Component Neural Networks Support Vector Machine Nearest Neighbors Feature Selection (Geo-) Spatial Modeling Recommendation Engine Search Engine Attribution Modeling Collaborative Filtering Rule System Linkage Analysis Linear Regression Logistic Regression Jackknife Regression Density Estimation Confidence Interval Test of Hypotheses Pattern Recognition Clustering Supervised Learning (classification) Time Series Decision Trees Random Numbers Monte-Carlo Simulation Bayesian Statistics Naive Bayes Association Rules Scoring Engine Segmentation Predictive Modeling Graphs Deep Learning Game Theory Imputation Survival Analysis Arbitrage Lift Modeling Yield Optimization Cross-Validation Model Fitting Data science without statistics is possible, even desirable (Vincent Grenville @DSC 2014)!!!! Statistics is Dead – Long Live Data Science… ( Lee Baker, @DSC 2016)!!!!!
  • 32. ‘’Let’s talk about Data science’’ Techniques used by Data Scientists (Source: KDNuggets 2017)
  • 33. ‘’Let’s talk about Data science’’ Data analysed by Data Scientist!!!!! (Source: KDNuggets)
  • 34. ‘’Let’s talk about Data science’’ Software used by Data Scientists (Source: KDNuggets)
  • 35. ‘’Let’s talk about Data science’’ Largest data analysed by Data Scientist (Source: KDNuggets)
  • 36. ’’Let’s talk about Data science’’ How to become a data scientist? Meeting the need School...core mathematics BSc...continue to focus on single disciplines, especially mathematics (including probability) and computing MSc...increase focus on statistics, begin to develop interdisciplinarity, but beware of “cut-and-paste data science" curricula. PhD...encourage interdisciplinary and team- based projects- PostDoc...focus on training fellowships, to include migrants from other disciplines (Peter J Diggle, 2015)
  • 37. Suggestions for a MSc in Business Data Science
  • 38. ‘’Let’s talk about Data science’’ Data Science challenges for Statistics According to a recent poll by Kdnuggets, the big majority (68%) of the audience thought that in the Era of Big Data, Statistics will become more important, as the foundation of Data Science. The rise of Data Science could be seen as a potential threat to the long-term status of the statistics discipline ….. but there is also a much greater opportunity to re- emphasize the universal relevance of the statistical thinking to the interpretation and exploiting of data, by improving links between statistics and information technology but also with those communities characterized by new and big data. We hope that the Statistician will be able to take this opportunity by developing new methods in a knowledge domain perspective, i.e • Computational Statistics Knowledge based • Statistical & Algorithmic intelligent data analysis contributing as well to the Data Science needs of the different scientific and professional domains implying new and big data. The cooperation between statisticians and computer scientists in the data revolution era, will allow to face in a proper way data management and preparation problems (data extraction, data and source integration, data cleaning and validation, knowledge coding). This task requires more than 70% of the whole data processing . It has a strong impact on the data quality and consequently on the data science results and actionable knowledge.
  • 39. “Big data” is everywhere. The term was added to the Oxford English Dictionary in 2013. Now, Gartner’s just-released 2017 Hype Cycle that shows “big data” passing the “peak of inflated expectations” and moving on its way down into the “trough of disillusionment.” Big data is all the rage. But what does it actually mean? We analysed more then 45 definitions registered on a blog at Berleley A commonly repeated definition cites the three Vs: volume, velocity, and variety. But others argue that it’s not the size of data that counts, but the tools being used or the insights that can be drawn from a datase ‘’Let’s talk about Data science’’ About BiG Data ……
  • 40. ‘’Let’s talk about Data science’’ A Lexical Correspondence analysis of 45 Big Data definitions 1ST Axe: opposition between Academic Authors and Professional Data Scientists A lemmas typology in 4 groups allows to identify different profiles of data scientist definitions (Academics, Influencers, DS Managers, DS Professionals )
  • 41. Big Data Definitions – 4 class Typology – The central definitions • The first group (which contains 50% of the lemmas) concerns definitions that aim to identify the characterizing traits of the concept of big data and therefore "complex", "dataset", "large" and the concepts related to it as "analysis" and "technique". In this group fall definitions in a certain mainstream way, definitions in which the key words usually used to describe the phenomenon abound. Among the definitions that represent this group are: "As computational efficiency continues to increase," "big data" will be less about the actual size of a particular data and more about the specific expertise needed to process it. big data "will ultimately describe any datasets large enough to need high-level programming skills and statistically defensible methodologies in order to transform the data asset into something of value“. • In the second group fall those definitions that try to contextualize the phenomenon of big data, in this group we find many concepts related to the temporal dimension as "time", "now" and "new". Among the most representative of this group we find: "Big data, which started as a technological innovation in distributed computing, is now a cultural movement by which we continue to discover how humanity interacts with the world - and other - at large-scale" .
  • 42. Big Data Definitions – 4 class Typology – The central definitions • Nel terzo gruppo invece troviamo le definizioni che danno anche prospettive extra-economiche dei big-data, riflettono su come i big-data potrebbero essere utili all'umanità intera e non solo in senso economico. In questo gruppo troviamo concetti come: "world", "people" "possibility". Tra le definizioni più rappresentative di questo gruppo troviamo: Big data is an umbrella term that means a lot of different things, but to me, it means the possibility of doing extraordinary things using modern machine learning techniques on digital data. Whether it is predicting illness, the weather, the spread of infectious diseases, or what you will buy next, it offers a world of possibilities for improving people’s lives. • Infine nell'ultimo gruppo le definizioni molto tecniche mirate anche alla promozione in senso economico dei big-data come quest’ultima: [Big data means] harnessing more sources of diverse data where “data variety” and “data velocity” are the key opportunities. (Each source represents “a signal” on what is happening in the business.) The opportunity is to harness data variety [and] automate “harmonization” of data sources to deliver fast- updating insights consumable by the line-of-business users.
  • 43. 1st class (31). AnnaLee Saxenian, Dean, UC Berkeley School of Information (Academic) I’m not fond of the phrase “big data” because it focuses on the volume of data, obscuring the far-reaching changes are making data essential to individuals and organizations in today’s world. But if I have to define it I’d say that “big data” is data that can’t be processed using standard databases because it is too big, too fast-moving, or too complex for traditional data processing tools. 2nd class (28). Gregory Piatetsky-Shapiro, President and Editor, KDnuggets.com (influencer) The best definition I saw is, “Data is big when data size becomes part of the problem.” However, this refers to the size only. Now the buzzword “big data” refers to the new data-driven paradigm of business, science and technology, where the huge data size and scope enables better and new services, products, and platforms. #BigData also generates a lot of hype and will probably be replaced by a new buzzword, like “Internet of Things,” but “big data”- enabled services companies, like Google, Facebook, 3rd class (5). Mike Cavaretta, Data Scientist Consultant (DS Consultant) You cannot give me too much data. I see big data as storytelling — whether it is through information graphics or other visual aids that explain it in a way that allows others to understand across sectors. I always push for the full scope of the data over averages and aggregations — and I like to go to the raw data because of the possibilities of things you can do with it. 4th class (22). Sharmila Mulligan, CEO and Founder, ClearStory Data (DS Manager dirigente) [Big data means] harnessing more sources of diverse data where “data variety” and “data velocity” are the key opportunities. (Each source represents “a signal” on what is happening in the business.) The opportunity is to harness data variety [and] automate “harmonization” of data sources to deliver fast-updating insights consumable by the line-of-business users.3 Big Data Definitions – 4 class Typology – The central definitions
  • 44. ‘’Let’s talk about Data science’’ Big Data challenges for Statistics Big data problems usually require multidisciplinary teams by their nature. They typically require knowledge domain experts, computational experts, machine learning experts, data miners and statisticians. • In particular Statisticians help translate the scientific question into a statistical question, which includes carefully describing data structure; the underlying system that generated the data (the model); and what we are trying to assess (the parameters we wish to estimate) or predict. What does Statistics bring to Big Data and where are the opportunities? • Statistics is fundamental to ensuring meaningful, accurate information is extracted from Big Data especially for the following: o Data quality ; o Missing and incomplete data; o Quantification of the uncertainty of predictions, forecasts and models. Statisticians are skillful at validation and correcting for bias; measuring uncertainty; designing studies and sampling strategies; data quality assessing and certification ; enumerating limitations of studies; dealing with issues such as missing data and other sources of non-sampling error; developing models for the analysis of complex data structures; creating methods for causal inference and comparative effectiveness; eliminating redundant and uninformative variables; data integration from multiple sources.
  • 45. Data Scientist:: No thanks! In order to conduct my business I need Big Data Informative Data Information Informative Data Big Data Knowledg e ID ID PROCESSING The most important thing about data is not its size but its informative content DECISION Data Engineer:: Big Data ? Some Knowledge representation tools: Interval, Histogram. Logical rules; Hierarchical rules; Probability models; Graphs; Network; Metadata; Ontologies…. Theory without data is blind. Data without knowledge is lame An useful approach to ID PROCESSING: SYMBOLIC DATA ANALYSIS Big Data Challenges for Data Scientists/Statisticians
  • 46. Informative Data as Symbolic Data Table SDT fig. in: E. Diday,Thinking by classes in data science: The symbolic data analysis paradigm. Wires , Vol. 8,, Sept. Oct., 2016 Symbolic Data Analysis tools as descriptive statistics, PCA, regression, decision trees, clustering, have been developed in order to analyze and discover new knowledge from Data. «A SDT quality can be measured in terms of explanatory and discriminatory power of its symbolic features» c A SDT offers a rapresentation of the variabiliy we find in the BIG DATA F. Brambilla: «Statistics is the Science that studies the vatiability of phenomena»
  • 47. Knowledge Discovery is a sequential learning process Supervised statistical methods allow investigators to produce new knowledge! ` Knowledge encoding & data integration Knowledge Pyramid Toward a Data Science Knowledge Based
  • 48. Conclusion: toward a Knowledge Based Data Science Data Science is an interdisciplinary approach, to meet the new challenges of the Information Society. It is based mainly on the methods of Statistics and Computational Science suitably supplemented by the Knowledge of the different domains. Computational Science represents the language of the Data Science whereas Statistics is the Logic of the Data Science itself. The Knowledge of the various domain of interest constitutes the prerequisite of a Data Science. Thus, from this point of view, it would be preferable to speak about DATA SCIENCES. The main novelty in the Data Sciences is played by the role of the KNOWLEDGE. Its encoding in a proper way (intervals, histograms, functions, logical rules or hierarchies, graphs, metadata, ontologies, etc….) can be used in the different step of a Data Science exercise: - in automating the step of (Big) Data cleaning and refinement (feature selection); - to obtain new (BiG) data representation in term of Informative Data; - to drive data processing methods in the right/expected direction avoiding trivial results; - to allow coherent interpretation of results and enrich storytelling; - to perform suitable decisions. For these reasons I like to call such an approach as a Knowledge based Data Science
  • 49. Prolegomena to Any Future Statistics, that will be able to present itself as a (Data) Science Carlo Lauro Emeritus Professor of Statistics University of Naples Federico II THANK YOU FOR YOUR ATTENTION!! carlo.lauro@unina.it carlo.lauro@selfschool.ch Scientific Meeting in Memory of Simona Balbi Naples, February 19° , 2019 Director of the Department of Economic & Managerial Sciences In Dnformation Era SELF HOCHSCHULE, ZUG - CH