Data is everywhere and in everything we do. Most of the time, usable information is hidden in raw data and because of that, there is an increasing demand for people capable of working creatively with it. To fully understand how we can assist data science workers to become more productive in their jobs, we first need to understand who they are, how they work, what are the skills they hold and lack, and which tools they need. In this paper, we present the results of the analysis of several interviews conducted with data scientists. Our research allowed us to conclude that the heterogeneity between these professionals is still understudied, which makes the development of methodologies and tools more challenging and error prone. The results of this research are particularly useful for both the scientific community and industry to propose adequate solutions for these professionals.
1. IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC)
Dunedin, New Zealand
11 - 14 August 2020
On Understanding
Data Scientists
Paula Pereira
University of Minho
Portugal
a77672@alunos.uminho.pt
Jácome Cunha
University of Minho & HASLab/INESC Tec
Portugal
jacome@di.uminho.pt
João Paulo Fernandes
CISUC, University of Coimbra
Portugal
jpf@dei.uc.pt
2. Forbes
“… each flight generating more than 30
times the amount of data the previous
generation of wide-bodied jets
produced.”
“By 2026, annual data generation should
reach 98 billion gigabytes, or 98 million
terabytes, according to a 2016 estimate
by Oliver Wyman.”
https://www.forbes.com/sites/oliverwyman/2017/06/16/the-data-science-revolution-transforming-aviation/#67f14be67f6c
3. Data Scientist: The
Sexiest Job of the
21st Century
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
4. We need to know more about
data science and data scientists!
5. • We conducted semi-structured
interviews to 8 people (1 excluded
- P3)
• 3 Female, 5 Male
• 1 Business Intelligence Manager
1 Big Data Architect
2 Data Analysts
4 Data Scientists
• Domains from music management
to web development
Participants
INTERVIEWEES INFORMATION.
ID Sex Age Job Title Education Level Domain
P1 F 30 Data Analyst Master, Marketing Music
Manag.
P2 M 36 Business
Intelligence
Manager
Master, Data Analysis
and Decision Support
Systems
Retail
P3 M 37 Big Data
Architect
Bachelor, Math and
Computer Science
Software
Dev.
P4 M 34 Data
Scientist
PhD, Electrical
and Computer
Engineering
Telecom.
P5 M 42 Data
Scientist
PhD, Data Mining Virtual
Call
Center
P6 F 26 Data Analyst Master, Mathematics
and Computation
Web Dev.
P7 F 32 Data
Scientist
Master, Mathematics
Engineering
Virtual
Call
Center
P8 M 32 Data
Scientist
PhD, Evolutionary Bi-
ology
Telecom.
7. Academic Background
• MSc in Marketing (Bachelor in Hotel Management)
• Bachelor in Economics
• PhD in Electrical and Computer Engineering
• MSc in Mathematics Engineering
• MSc in Mathematics and Computation
• PhD in Evolutionary Biology
• PhD in Data Mining
8. Implications
• Some find the need to learn more
“I had been working in auditing information systems for two years, and at that time I decided that data
was ‘the thing’and I went to get a master’s degree in Data Analysis and Decision Support Systems.” — P2
• Background also defines the kinds of tasks performed
• Only those with training in CS or engineering do tasks related to the
creation of machine learning and deep learning models
• The remaining dedicate themselves to more direct analysis, based on
statistical parameters such as average, standard deviation, distributions
9. Data Sources and Quality
• Data sources
• data generated internally by various teams
• public data sources is also frequent
• The need for data integration is significant
• Data ranges from customer data, to operational data
• Only one case (P2) reported using some kind of data quality metrics
11. • R and Python (not new)
• Choice made according to personal preferences and the type of tasks
• Some cases (P5, P7), choose as a team so that all elements use the same technologies
• Most participants do not use data analysis tools
• These tools end up limiting their analysis
• Does not happen when they produce their own code
• However, P1 and P2 do a large part of their analysis using only MS Excel (very fast
results)
Tools and Programming Languages
12. Difficulties
• Lack of training in the field of data science
• Access to information with quality and relevant to the problems in hands
“I believe that access to quality information and information relevant to our problems is the greatest
challenge.” – P4
• Lack of teammates
“In my case, being alone is a big limitation, ...and initially, it is very difficult to have the required business
expertise to understand what are its needs.” – P6
13. Yes, There Are More Difficulties
• Often very difficult to convert business problems into data science issues
• Development of stable and scalable code
“On a personal level, I think my biggest challenge is to write a stable and scalable code because my
training is not very oriented for software engineering.” – P8
• Professionals being hired for data science positions that should be occupied
by other type of professionals
“Companies look at the market and, because there is a demand for data scientists, they also want to hire
one. However, looking at the job’s requirements, their needs would be easily mitigated by other types of
professionals.” – P7
14. Almost…
• Very unclear/eclectic definition of data science job
• It’s important to clarify which are the different areas of data science
• This would help the professionals who wish to work in this field to position
themselves correctly
“In my opinion, there are two main areas: technological and application data science. The data scientist of the future
must know how to put himself in the right area of data science to avoid regretting what (s)he is doing.” – P2
• All participants agree that it is a great advantage to have people with different
backgrounds in data science teams because
• Bring different perspectives on the data
16. Opportunities for the Research Community
• We still need to learn more about data scientists
• Only then we can help them
• They are also (data science) end users
• As we (have) help(ed) software engineers and developers end-users, we need
to help these new end users
• Tailored languages, tools, methodologies, …
• For learning, data cleaning, analysis, visualization, integration, etc.
16
17. IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC)
Dunedin, New Zealand
11 - 14 August 2020
On Understanding
Data Scientists
Paula Pereira
University of Minho
Portugal
a77672@alunos.uminho.pt
Jácome Cunha
University of Minho & HASLab/INESC Tec
Portugal
jacome@di.uminho.pt
João Fernandes
CISUC, University of Coimbra
Portugal
jpf@dei.uc.pt