Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
(Big) Data (Science) Skills
1. (Big) Data (Science) Skills
Big Data Value Association Summit in Madrid
17/06/2015
Oscar Corcho
ocorcho@fi.upm.es
@ocorcho
https://www.slideshare.com/ocorcho
2. License
• This work is licensed under the license
CC BY-NC-SA 4.0 International
• http://purl.org/NET/rdflicense/cc-by-nc-sa4.0
• You are free:
• to Share — to copy, distribute and transmit the work
• to Remix — to adapt the work
• Under the following conditions
• Attribution — You must attribute the work by inserting
• “[source Oscar Corcho]” at the footer of each reused slide
• a credits slide stating: “These slides are partially based on
“(Big) Data (Science) Skills” by O. Corcho”
• Non-commercial
• Share-Alike
3. Data Scientist: Technical and Soft Skills needed
• One of the two or
three pictures
expected from a talk
on skills…
• I may start going
through
• Each of these topics
• Discussing on the
specific skills needed
• However…
Sorry, looking for the reference to add here
4. What is Big Data?
Source: http://www.philipchircop.com/post/25783275888/seeing-the-full-elephant-its-a-tree-its-a
6. Characteristics of an ecological niche
• A niche is defined by a spectrum of resource usage
• Species differ from each other in how efficient they are in
using resources that change continuously
• Characteristics of a niche
• Amplitude (range in which resources are used)
• Generic species (they can use a wide range of
resources)
• Specialist species (they require a very specific
combination of resources)
• Overlap (similarity among niches in their usage of resources)
• Competitive exclusion principle (Gause, 1934)
• If two species coexist in a stable environment, they do it as a
differentiation of their effective ecological niches.
Source: Javier Seoane. Ecología. Unidad Temática 21. Teoría del nicho ecológico
8. Big Data Niche 1. HPC and e-Infrastructure Experts
Background: Computer Science (Systems)
System Administration
Terms used in their native language:
Blades, Infiniband, OpenMPI,
racks, HDF, TBs, Gflops
Their daily life:
Check system logs
Make sure that queues are active
Install a new rack
What’s Big Data for them?
A “commercial” term for something
that they have done for a long time
They really know how to configure
and monitor a Hadoop cluster
They would love seeing those talking
about Big Data executing processes
on fluid dynamics
9. Big Data Niche 2. Data Storage and Access Experts
Background: Computer Science
Database administration
Terms used in their native language:
SQL, NoSQL, Column store
Transacions, Hive, TBs/PBs/…,
TPS (Transactions per s)
Their daily life:
Optimise several queries
Run a new benchmark
Design an optimiser/physical operator
What’s Big Data for them?
A new opportunity to work on
optimisation algorithms
They know how to configure a database
They often laugh at those who deploy
a NoSQL solution for a problem
that can be solved with a
relational database
10. Big Data Niche 3. Machine Learning Experts
Background: Mathematics, Statistics,
Physics, Computer Science
Terms used in their native language:
Complexity, algorithm, p-value,
convergence, precision, recall
ROC curves, bayesian networks, R
Their daily life:
Read about a new problem
Write down a few formulae in the
whiteboard (even blackboards)
Prove that the algorithm terminates
What’s Big Data for them?
The same problems applied to data of
larger size, with new challenges
Problems are not only solved in
Haddop or a powerful NoSQL DB
Astonished by those who still mix up
correlation and causality
11. Big Data Niche 4. Slow-data Experts
Background: Computer Science, Statistics,
Library Sciences, Linguistics
Terms used in their native language:
Information model, vocabulary,
ontology, data quality, curation
Their daily life:
Receive a database schema
Talk to data producers and (re)users
Obtain consensus and transform data
What’s Big Data for them?
The difficulty lies on the variety of
data formats and structures
We may integrate data from varied
sources, although this is not
always possible
When you manage to integrate
heterogeneous data, you can achieve
better results
12. Big Data Niche 5. (Big Data) Consultants
Background: Computer Science, Economy,
…
Terms used in their native language:
Business model, business opportunity,
Big Data, Data Value Chain,
Hadoop, Spark, R, TBs, GFlops
Their daily life:
Read a Gartner Big Data report
Talk to potential customers
Transfer needs to technicians
What’s Big Data for them?
It’s the 4Vs, plus a few more
I have a PPT presentation with a
Big Data infrastructure,
architecture,
and previous projects, which I will
use to sell a project to my
customers
13. Are we missing any ecological niche?
• We have already seen a couple of ecological
niches…
• They all coexist
• Some of them are overlapping
Is there anyone that has not been yet
considered?
14. The evolution of a new species: the Data Scientist
Background: Computer Science+Statistics+
+Mathematics+Economy+
…
Terms used in their new exotic language:
HPC, databases, algorithms,
harmonisation, integration,
Hadoop, Spark, R, TBs, GFlops
Their daily life:
Learn about a new infraestructure
Code scripts to be run on Spark
Interpret results
Install a new framework
Read a few scientific papers
Make shiny presentations
Describe in their blog the activities
that they do, so that Big Data is
better known and understood
…
16. Data Scientists and Pi-shaped people
• Let’s now go into
the expected
discussion
Sorry, looking for the reference to add here
17. Will all species survive?
• If Big Data defines an ecosystem…
• Which species will survive?
• Will Data Scientists wipe out the other species?
• Or will they be able to live in perfect symbiosis?
What is the ideal training required
for the individuals of these
species so that they can survive?
21. Masters in Data Science, Big Data and alike (III)
Year 1
• Data handling
• Data analysis
• Advanced data analysis and data
management
• Visualization
• Applications
Year 2
22. Are we doing it right in terms of training?
• Probably it is all about lack of maturity in the area, but
syllabi do not seem to be perfectly compatible…
• It is not easy to believe that we can create Data
Scientists in only one year
• Should we train people to know a bit about everything?
• Or should we separate more clearly the species in our
ecosystem and specialise them better for their work?
How do we manage to keep a
healthy and stable ecosystem?
23. Shameless self-promotion
• Strategies for success in the
Digital-Data Revolution
• Separation of concerns
• Intellectual ramps
• Data-intensive knowledge
discovery
• Components and usage
patterns
• Data-intensive engineering
• Development vs enactment
• Data-intensive application
experiences
• In Science
• In Business
Can we learn from lessons
learned in Data-Intensive
Science?
24. Separation of concerns: three clear profiles
• Domain experts (WHAT)
• They know the problems they want to
solve
• They know the application domain
• They can create (scientific) workflows
• Data-intensive analysts (WHAT)
• They know a lot about (Big) data
analysis
• The may not know about the
infrastructure behind the scenes
• They do not necessarily know all the
details of the application domain
• Data-intensive engineers (HOW)
• They know a lot about distributed
computing/infraestructure/HPC/cloud
s/etc.
• They received the description of an
algorithm and they can make it more
efficient (parallelisation)
25. Separation of concerns: Differentiated tasks
[<select =
"1<= day(inp.first.start)<=5",
project="inp">,
<select =
"6<= day(inp.first.start)<=10",
project="inp">,
<select =
"11<= day(inp.first.start)<=15",
project="inp">,
... ]
Programmable
Filter
Project
outputs
inp
rules
distrib
"second.fURI ASC..."
Sort
outp
data
rule
Sort
outp
data
rule
Sort
outp
data
rule
Sort
outp
data
rule
["first,second"]
Tuple
Burst
outp
input
structcols inputs
Tuple
Burst
outp
input
structcols inputs
Tuple
Burst
outp
input
structcols inputs
Tuple
Burst
outp
input
structcols inputs
De
List opinp
De
List opinp
De
List opinp
De
List opinp
inp
CorrFarm
User and application diversity
System complexity
Iterative "what"
process
development
Mapping,
optimisation,
deployment and
execution
Accommodating and facilitating
Several application domains
Several tool sets
Several process representations
Several working practices
DISPEL representation
Composing and providing
Many autonomous resources
One enactment mechanism
A single platform
Gateway
Tool level
Enactment
level
Component
library
26. Conclusions
• We all know that there are big opportunities in Big Data
• But we need to be more productive. For that we need:
• Create real multidisciplinary teams with at least three roles
(application developers, data-intensive analysts and data-intensive
engineers)
• Understand that simply by using Hadoop, Spark or R we are not
necessarily doing Big Data
• The same as by coding in Java we are not necessarily
understanding object-oriented programming
• Understand that we have to interpret results adequately, from a
scientific point of view
• Understand the importance of homogeneising datasets, in order to
facilitate their integration (slow-data)
• Continue working on delivering tools that can be used to develop
Big Data applications more productively
• Should we also be funding this?
27. (Big) Data (Science) Skills
Big Data Value Association Summit in Madrid
17/06/2015
Oscar Corcho
ocorcho@fi.upm.es
@ocorcho
https://www.slideshare.com/ocorcho