1. BIG DATA - AS OPPOSED
TO SMALL DATA
Mark Whitehorn
2. What is Big data?
Is it really just a marketing campaign?
http://www.perceptualedge.com/articles/visual_business_intelligence/big_data_big_ruse.pdf
“If you’re like me, the mere mention of Big Data
now turns your stomach….Why all the fuss? Why,
indeed. Essentially, Big Data is a marketing
campaign, pure and simple.”
Stephen Few
2
3. Big data
Clearly I am not like Stephen Few.
I don’t believe I have a particular axe to grind, I
simply find this interesting
This talk is designed to try to explain:
• what Big Data is
• what characteristics we have found useful
• why it may be of interest to you
3
• a paradox
5. Data
So, in the ’60 and ‘70s we rapidly learnt to
separate the data, and its manipulation, from
the application
5
6. Data
So, in the ’60 and ‘70s we rapidly learnt to
separate the data, and its manipulation, from
the application
Which led directly to the development of
database engines and, ultimately, relational
ones (DB2, Oracle, SQL Server)
6
7. Data
Data has always existed in two, very broad,
flavours…..
• Data that is treated as small, discrete
packages and is a good fit with the
relational way of storing and querying data
• Data that is not as above
7
8. Data is stored in tables
LicenseNo Make Model Year Colour
CER 162 C Triumph Spitfire 1965 Green
EF 8972 Bentley Mk. VI 1946 Black
YSK 114 Bentley Mk. VI 1949 Red
8
Mark Whitehorn
9. Data is stored in tables
Each table has a name
Car
LicenseNo Make Model Year Colour
CER 162 C Triumph Spitfire 1965 Green
EF 8972 Bentley Mk. VI 1946 Black
YSK 114 Bentley Mk. VI 1949 Red
9
Mark Whitehorn
10. Data is stored in tables
Car
LicenseNo Make Model Year Colour
CER 162 C Triumph Spitfire 1965 Green
EF 8972 Bentley Mk. VI 1946 Black
YSK 114 Bentley Mk. VI 1949 Red
Data is
atomic 10
Mark Whitehorn
11. Data is stored in tables
Columns
Car
LicenseNo Make Model Year Colour
CER 162 C Triumph Spitfire 1965 Green
EF 8972 Bentley Mk. VI 1946 Black
YSK 114 Bentley Mk. VI 1949 Red
11
Mark Whitehorn
12. Data is stored in tables
Columns
Car
LicenseNo Make Model Year Colour
CER 162 C Triumph Spitfire 1965 Green
Rows EF 8972 Bentley Mk. VI 1946 Black
YSK 114 Bentley Mk. VI 1949 Red
12
Mark Whitehorn
13. Data is stored in tables
Car
LicenseNo Make Model Year Color
CER 162 C Triumph Spitfire 1965 Green
EF 8972 Bentley Mk. VI 1946 Black
YSK 114 Bentley Mk. VI 1949 Red
Each row represents a unique entity in
the ‘real’ world……
13
Mark Whitehorn
16. Data
Note that this kind of manipulation is treating
the data as atomic, which is fine, because the
relational model assumes atomicity of data
Note also, that the rows are unordered
16
17. Data
• Data has always existed in two, very broad,
flavours…..
• Data that is inherently atomic and is a good
fit with the relational way of storing and
querying data
• Data that is not as above
17
18. Examples
• Examples of ‘other’ data:
• Images
• Music
• Word docs
• Sensor data
• Web logs
• Twitter
• Machines
• Point of Sale
18
• Mass spectrometers
19. What’s in a name?
So, what do we call the ‘rest’?
• Un-structured?
• Semi-structured?
• Multi-structured?
• Non-relational?
• Non-tabular?
19
20. What’s in a name?
• What about:
• Big data?
20
21. Other definitions?
•VVVvvvv
• Volume
• Variety
• Velocity
• Value
• Very interesting
• Various other words beginning with V…..
21
22. Big Data – not new?
• So why have we focused, for the last 30 years,
almost exclusively on the first flavour?
• Because it:
• is easy (relatively easy – Jim Gray*)
• represents a significant proportion of the
available data
*Jim Gray and Andreas Reuter - Transaction Processing: Concepts and
Techniques (1993)
Turning Award 1998
22
23. Big Data has come of age
• Two factors have changed
• Rise of the Machines
• Increase is computational power
• There is a great synergy here
• We are acquiring far more big data and we
have computational power to extract the
information it contains
23
24. Big Data is hard
• 3 Vs
• It is highly variable
• We often want to look inside the data
• Frequently non-atomic
• Need custom functions for virtually every
operation
• find the rotating wing aircraft in the image
• Identify the best customer
• What does the blog sphere think of our
company? 24
27. Big Data
• Examples
• Log file
• Mass spec.
• Images
27
28.
29. What is Big Data?
• Examples
• Log file
• Mass spec.
• Images
BIG DATA
30. Summary so far……
• Just as you can always fit an aircraft engine
into a car chassis, you can always put Big
Data in a table, but you probably don’t
want to
• The analysis is not sub-setting the data by
rows and columns
• So each class of big data usually require a
(lovingly hand-crafted) custom analysis
30
31. Case Study
Big Data in the Life Sciences World
The massed spectrometers
Why would anyone do that?
31
33. Human Genome Project
Human Genome Project
$3 billion – 13 Years
Our genes define us.
Errr…. how does
that work exactly?
33
34. What is a protein?
DNA Protein
blueprint product
34
35. Why study proteins
PROTEOME
GENOME
Genes contain Proteins carry out
instructions for creating functions within a cell 35
proteins
36. Protein: ACTIN Example Proteins
Function: Contracts Muscles
Protein: Insulin
Function: Controls Blood Sugar
Protein: Keratin
Function: Forms Hair and Nails
O 2
Protein: Hemoglobin
Function: Carries Oxygen
Protein: Antibody
Function: Fights Viruses
36
37. biSCIENCE
20-25,000 genes in the human
genome.
Every nucleated cell in the same
human has the same genome.
But not all genes are active at
the same time.
Perm any 15-18,00 active
proteins in any one cell at any 37
one time.
38. slowly changing millions of years
over a day rapidly changing 38
39. Studying Proteins
Proteins are chopped up using an
enzyme to make them easier to measure.
A specialised instrument (Mass Spectrometer) is
used to measure (‘weigh’) the small protein
fragments.
We can use the mass of the small fragments to
carry out intelligent database searches to identify
which protein was detected.
39
46. Localisation
Protein Peptide Alignment Map
Normalised
Profiles for
Synthesis,
Degradation
and Turnover
Comparison Between
Compartments
46
47. Custom analysis and custom visualisation –
vital tools in understanding big data
47
48. Intensive Data Processing Required to derive
Information from the raw data
Base Line Correction Peak Detection
BIOConductor PROcess R Package
Deisotoping
48
Proteomics Volume 3, Issue 8, Article first published online: 12 AUG 2003
49. “proteomics is much more complicated
than genomics . . . while an organism's
genome is more or less constant, the
proteome differs from cell to cell and
over time”
Computationally, perhaps three orders
of magnitude more complex than HGP 49
50. biSCIENCE
Why bother trying to quantify it?
Because this is payback time.
Documenting the proteome
opens the door to a whole new
world. 50
51. biSCIENCE
So, what is a data scientist?
My favourite description comes from Twitter:
“Yeah, so I'm actually a data scientist. I just do this
barista thing in between gigs.”
More cynically:
“A data scientist is just an analyst who lives in
California.”
51
52. biSCIENCE
Possibly more accurate is that a data scientist (DS) is
“a better software engineer than any statistician and
a better statistician than any software engineer”.
52
53. biSCIENCE
DSs are also part artist and part engineer. They
need a toolbox of techniques, skills, processes and
abilities from which to construct novel solutions.
And they need the ability to create a UI that turns
their abstract finding into something that the users
of the system can understand, so DSs also need the
skills to create elegant visualisations that turn raw
data into information.
53
54. biSCIENCE
And (yes, there’s more) they need to be able to
communicate well with people. There is little use in
creating a superb analytical process if you can’t
communicate how and why it works to the board
members.
54
55. biSCIENCE
And then there is the curiosity. Duncan Ross
(Director of Data Sciences at Teradata) characterised
data scientists well:
The first and most important trait is curiosity. Insane
curiosity. In many walks of life evolution selects
against the kind of person who decides to find out
what happens “if I push that button”. Data Science
selects for it.
55
56. biSCIENCE
So, what are the general characteristics of a DS?
They include:
• insatiable curiosity (see above)
• interdisciplinary interests
• excellent communication skills
• excellent analytical capabilities
56
57. biSCIENCE
DSs also need a good working knowledge of:
• machine learning techniques
• data mining
• statistics
• maths
• algorithm development
• code development
• data visualisation
• multi-dimensional database design and
implementation
57
58. biSCIENCE
Specific skills include the technologies to handle big
data:
• NoSQL databases
• Hadoop and related technologies
• MapReduce and its implementation on differing
software platforms
58
59. biSCIENCE
DSs also have an intimate knowledge of languages
such as:
• SQL
• MDX
• R
• Functional and OOP languages such as Erlang and
Java
59
60. biSCIENCE
Most of all, no matter what they are called, all true
data scientists have started playing with some data
at 8:00PM and suddenly found it is 3:00AM.
61. Case Study
Twitter
Who loves you?
Social/text/sentiment
61
64. Consider the humble tweet…
As, indeed, Sally Bercow should
have done *Innocent Face*
64
65. Consider the humble tweet…
I’d just like to apologise for that
last slide but I would point out
that it “contained no accusation
whatsoever … Mischievous but
not libellous.”
65
66. Case Study
Oil Rig data
Gone fishing
Sensor data
66
67. Lessons learned
• Engagement
• Choose you battles – look for an area where you
can gain competitive advantage
• Choose your platform carefully
• Programming – algorithm development
• Data scientists
• Custom algorithms
67
• Custom visualisations
68. Thank you very much for listening
Any Questions?
Mark Whitehorn
(MarkWhitehorn@computing.dundee.ac.uk)
68
69. BIG DATA - AS OPPOSED
TO SMALL DATA
60 minutes
Mark Whitehorn
Notas do Editor
Each scan in the data file requires a lot of highly intensive processing in order to determine what proteins were present in the cell.Some examples…..Currently a single threaded pc based application is used