Table of Contents
• Definitions
• Big Data 3V's
• Internet Stats
• Applications & Examples
• Data Science Areas
• Identities and Skills
• Data Work Flow
• Challenges
• Data Generation
• Data Structure
• Cloud Service Providers
2
• Hadoop Ecosystem
• Data Visualization
• Data Analytics Methods
• Data Trends
• Programming Languages
• NoSQL Databases
• Interesting Facts
• Interesting Insights
• Data Sources
• Keywords & Glossary
• References
Big Data - Definitions
1. The first documented use of the term “big data” appeared in a 1997 paper
by scientists at NASA, describing the problem they had with visualization
(i.e. computer graphics) which “provides an interesting challenge for
computer systems: data sets are generally quite large, taxing the capacities
of main memory, local disk, and even remote disk. We call this the problem
of big data. When data sets do not fit in main memory (in core), or when
they do not fit even on local disk, the most common solution is to acquire
more resources.” (NASA)
2. Data of a very large size, typically to the extent that its manipulation and
management present significant logistical challenges. (Oxford English
Dictionary)
3. Big data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage, and process
data within a tolerable elapsed time. (Wikipedia)
3
Big Data – Basic 3V’s
Big Data: Extremely large data
sets that may be analyzed
computationally to reveal
patterns, trends, and
associations, especially relating to
human behavior and interactions.
(Google)
5
Velocity
Variety
Volume
Big Data – Basic 3V’s
1. Volume: Huge amount of data (Terabytes of Records, Transactions,
Tables, Files)
2. Velocity: High rate of data and information flowing into and out of
our systems (Batch, Real-time, Streams, Near-time)
3. Variety: Complexity, thousands or more features per data item
(Structured, Unstructured, Semi-Structured)
6
Big Data – MoreV’s
• Veracity: Accuracy and uncertainty of data
• Validity: Data quality, clean/unclean data
• Variability: Constantly changing/dynamic data
• Value: The potential business value/ROI of data
• Venue: Distributed, heterogeneous data from multiple platforms
• Vocabulary: Schema, data models, semantics, ontologies,
taxonomies, context based
• Vagueness: Confusion over the meaning of data
• Visibility: Open/Secure data
• Visualization: Presentation of data in a readable and accessible way
7
Big Data – Moore’s Law
Physical capacity and performance of computers double about every two years!
8
Big Data – Internet Stats
• The data volumes are exploding, more data has been created in the past two
years than in the entire previous history of the human race.
• Data is growing faster than ever before and by the year 2020, about 1.7
megabytes of new information will be created every second for every human
being on the planet.
• By then, our accumulated digital universe of data will grow from 4.4
zettabytes (1021) today to around 44 zettabytes, or 44 trillion gigabytes.
• Every second we create new data. For example, we perform 40,000 search
queries every second (on Google alone), which makes it 3.5 billion searches
per day and 1.2 trillion searches per year.
• In Aug 2015, over 1 billion people used Facebook FB +2.39% in a single day.
10
Big Data – Internet Stats – Continued
• Facebook users send on average 31.25 million messages and view 2.77
million videos every minute.
• We are seeing a massive growth in video and photo data, where every
minute up to 300 hours of video are uploaded to YouTube alone.
• In 2015, a staggering 1 trillion photos will be taken and billions of them will be
shared online. By 2017, nearly 80% of photos will be taken on smart phones.
• This year, over 1.4 billion smart phones will be shipped – all packed with
sensors capable of collecting all kinds of data, not to mention the data the
users create themselves.
• By 2020, we will have over 6.1 billion smartphone users globally (overtaking
basic fixed phone subscriptions).
11
Internet Stats - Continued
• Within five years there will be over 50 billion smart connected devices in the world,
all developed to collect, analyze and share data.
• By 2020, at least a third of all data will pass through the cloud (a network of servers
connected over the Internet).
• Distributed computing (performing computing tasks using a network of computers in
the cloud) is very real. Google GOOGL +0.63% uses it every day to involve about
1,000 computers in answering a single search query, which takes no more than 0.2
seconds to complete.
• The Hadoop (open source software for distributed computing) market is forecast to
grow at a compound annual growth rate 58% surpassing $1 billion by 2020.
• Estimates suggest that by better integrating big data, healthcare could save as much
as $300 billion a year — that’s equal to reducing costs by $1000 a year for every
man, woman, and child.
12
Internet Stats - Continued
• Estimates suggest that by better integrating big data, healthcare could save as much
as $300 billion a year — that’s equal to reducing costs by $1000 a year for every
man, woman, and child.
• The White House has already invested more than $200 million in big data projects.
• For a typical Fortune 1000 company, just a 10% increase in data accessibility will
result in more than $65 million additional net income.
• Retailers who leverage the full power of big data could increase their operating
margins by as much as 60%.
• 73% of organizations have already invested or plan to invest in big data by 2016
• Favorite fact: At the moment less than 0.5% of all data is ever analyzed and used,
just imagine the potential here.
• More stats: http://www.internetlivestats.com
13
Big Data – Consumer Applications
• Google Search!
• IPhone Siri
• Microsoft Cortana
• Amazon Suggestions
• Spotify Suggestions
• Yelp Recommendations
• Netflix Recommendations
• Google Now!
14
Big Data – Business Applications
• Google Ads Searches: Showing relevant ads to users
• Predictive Marketing: consumer behavior, users demographic info
• Banking: Fraud detection, risk reporting, customer data analysis
• Financial: Stocks prediction, Forex
• Fraud Detection: spam filtering, online payments
• Health: self-aware medics, sports analysis, genomics, health records
• Smart Cities: IoT, transportation, traffic, governance, energy, economy
• Social Media: friends, topics, videos recommendations
• Education: LMS tracks & logs, time spent on subjects
15
Big Data – ResearchApplications
• Google Trends: Flu, Zika & Ebola virus, racial justice, supporting refugees
and migrant crisis
• National Institute of Health: Brain Innovative Neurotechnologies to create a
full map of brain functionalities
• NASA: Kepler space telescope searching for exoplanets/planets out side of
our solar system
• Facebook Graphs: Revealing relationships, six-degrees of separation,
psychological and personality data
• Google Books: Ngram Viewer, History of words, their usage, different
meanings
16
Big Data – Example 1 – UPS Post
• Insight: Optimize the routing again, predict the
maintenance requirements of vehicles.
• System: ORION database: engine performance,
speed, number of stops, mileage, miles per gallon,
GPS, driver behavior, safety habits, emissions,
fuel consumption, deliveries, customers,
addresses, routes. 250 million+ data points.
• Analysis: Advanced mathematical models that
provide additional optimization and navigational
capabilities to make drivers more efficient.
• Result: Saved over 39 million gallons of fuel,
avoided 364 million miles, reduced engine idle
time by 10 million minutes.
17
Big Data – Example 2 –Walmart
• Insight: Customers stock up on certain products in
the days leading up to predicted hurricanes.
• System: RetailLinksystem records sale, triggers
reordering, scheduling, and delivery. Back-office
scanners track shipments. Partners use RFID
technology to track and coordinate inventories. Data
includes daily sales, shipments, returns, purchase
orders, invoices.
• Analysis: Mines data to get its product mix right
under all sorts of varying environmental conditions.
• Result: Revenues greater thananyfirm in the US.
RFID boosted sales 20%. Gillette increased sales
19%.
18
Big Data – Example 3 – Fraud at eBay
• Insight: Fraud spikes mid-week, enabling
fraudsters to receive goods by the weekend. Basic
fraud pattern= long-distance, high-dollar,
expedited shipping.
• System: Names, email, addresses, device
fingerprinting, IP address, geolocation lookups,
time zones, countries in Oracle database of 1.3
billion entries.
• Analysis: Run transactions against 600 rules, 20-
plus machine learning algorithms. Regularly tweak
the fraud rules.
• Result: In 2014, prevented $55-million worth of
fraudulent transactions.
19
Big Data – Example 4 – Kaiser Permanente
• Insight: Kaiser Permanente:
HealthConnectexchanges data across all facilities,
promotes electronic records. Improved outcomes in
cardiovascular disease and saved $1 billion from
reduced office visits and lab tests.
• System: Pharmaceutical companies have
aggregated years of research and development data
into medical databases, payorsand providers have
digitized patient records, public stakeholders have
opened data from clinical trials. 4 billion petabytes.
• Analysis: Determine whether standard protocol for a
disease produces optimal results.
• Result: $300 billion to $450 billion in reduced health-
care spending.
20
Big Data vs Small Data
21
Aspect Small Data Big Data
Goals Have specific goal May have a goal
Location On a single computer On the cloud (multiple servers)
Structure Highly structured Semi-structured/unstructured
File Types SQL, Excel Documents, multimedia, graphs, tables
Data Preparation Prepared by one user
Prepared, analyzed, used by different group
of users
Longevity Short time period Continues for a long time
Measurements Single unit (cm) Multiple units (cm, inch,…)
Reproducibility Usually reproducible Rarely reproducibility
Lost Costs Limited Huge amount
Introspection Clear meaning Complex meaning, meaningless
Analysis Can be analyzed at once Needs an analysis procedure
Big Data – Professional Identities
23
Data Developer Developer Engineer
Data Researcher Researcher Scientist Statistician
Data Creative Jack of All Trades Artist Hacker
Data Businessperson Leader Businessperson Entrepreneur
Big Data – Five Skill Groups
24
Business ML / Big Data Math / OR Programming Statistics
Product
Development
Unstructured Data Optimization
System
Administration
Visualization
Business Structured Data Math
Back-End
Programming
Temporal
Statistics
Machine Learning Graphic Models
Frond-End
Programming
Surveys and
Marketing
Big and
Distributed Data
Bayesian / Monte
Carlo Statistics
Spatial Statistics
Algorithms Science
Simulation Data Manipulation
Classical
Statistics
Big Data – Scientific Data
• Genetic Data (1V): High Volume of data in a structured way
• Earthquake Prediction (1V): High Velocity of data, almost real-time
• Facial Recognition (1V): High Variety of data
• Jet Engine Sensors (2Vs): High Volume + High Velocity (20TB/hour
data)
• Surveillance Video (2Vs): High Velocity + High Variety of data
streaming
• Google Books (2Vs): High Volume + High Variety of data (30 Million
books)
26
Big Data – Common Challenges
• Anonymity: danger of de-anonymizing public data, social network graphs,
medical data,…
• Confidentiality: trying to protect data and access levels, storing unimportant
data and it’s responsibility
• Data Quality: Nearly 95% of spreadsheets have errors
• Incomplete or corrupted data
• Duplicate records
• Typographical errors
• Data without context/missing context
• Incomplete transformations
• Data conversion errors
28
Big Data – Security Challenges
• Secure computations in distributed programming frameworks
• Security best practices for non-relation data stores
• Secure data storage and transaction logs
• End-point input validation/filtering
• Real-time security/compliance monitoring
• Scalable and composable privacy-preserving data mining and analytics
• Cryptographically enforced access control and secure communication
• Granular access control
• Granular audits
• Data provenance 29
Big Data – Human Generated Data
• Intentional Data: Chats, photos, videos, comments, likes, web
searches, emails, cell phone call, text messages, online purchases,…
• Meta Data: Data about data, second order data
• Photo metadata taken by cameras
• Cell phones time and location
• Emails To, From, CC, BCC
• Social networks connectivity's
• Twitter collects 150 pieces of metadata for each tweet
30
Big Data – IPhone 4s Photo EXIF Metadata
31
ExifToolVersion Number : 8.68
File Name : IMG_1031.JPG
Directory : . File Size : 3.1 MB File Modification
Date/Time : 2011:10:05 01:43:44-07:00 File
Permissions : rw-r--r-- FileType : JPEG MIME Type :
image/jpeg Exif Byte Order : Big-endian (Motorola,
MM) Make : Apple Camera Model Name : iPhone
4S Orientation : Rotate 180 X Resolution : 72Y
Resolution : 72 ResolutionUnit : inches Software :
5.0 Modify Date : 2011:08:24 13:13:33YCb Cr
Positioning : Centered ExposureTime : 1/286 F
Number : 2.4 Exposure Program : Program AE ISO
: 64 ExifVersion : 0221 Date/TimeOriginal :
2011:08:24 13:13:33 Create Date : 2011:08:24
13:13:33 ComponentsConfiguration :Y,Cb, Cr, -
Shutter SpeedValue : 1/286 ApertureValue : 2.4
BrightnessValue : 6.992671928 Metering Mode :
Multi-segment Flash : Auto, Did not fire Focal
Length : 4.3 mm SubjectArea : 1631 1223 881 881
FlashpixVersion : 0100 Color Space : sRGB Exif
ImageWidth : 3264 Exif Image Height : 2448
Sensing Method : One-chip color area Exposure
Mode : AutoWhite Balance : Auto Focal Length In
35mm Format : 35 mm SceneCaptureType :
Standard
Sharpness : NormalGPS Latitude Ref : North GPS
Longitude Ref : West GPSAltitude Ref : Above Sea
Level GPSTime Stamp : 21:08:30 GPS Img
Direction Ref :True NorthGPS Img Direction :
346.4727273 Compression : JPEG (old-style)
ThumbnailOffset : 908Thumbnail Length : 12311
ImageWidth : 3264 Image Height : 2448 Encoding
Process : Baseline DCT, Huffman coding Bits Per
Sample : 8 Color Components : 3YCb Cr Sub
Sampling :YCbCr4:2:0 (2 2) Aperture : 2.4 GPS
Altitude : 1222 m Above Sea LevelGPS Latitude :
37 deg 44' 10.80" N GPS Longitude : 119 deg 35'
58.80" W GPS Position : 37 deg 44' 10.80" N, 119
deg 35' 58.80"W Image Size : 3264x2448 Scale
FactorTo 35 mm Equivalent: 8.2 Shutter Speed :
1/286Thumbnail Image : (Binary data 12311 bytes,
use -b option to extract)CircleOf Confusion : 0.004
mm FieldOfView : 54.4 deg Focal Length : 4.3 mm
(35 mm equivalent: 35.0 mm) Hyperfocal Distance :
2.08 m LightValue : 11.3
Big Data – Computer Generated Data
• Sources: Cell phones connecting to towers, Satellite radio, GPS
connecting, Wi-Fi connections, Web Crawlers,…
• Internet of Things (IoT): Information collected an transmitted via IoT
devices, Production Lines, Smart Meters, Environmental Monitoring,
Industrial Applications, Infrastructure Management, Energy
Management, Medical and Healthcare Systems, Smart Buildings,…
• Machine to Machine: Server to Server connections, Web Services,
Cloud Computations, Real-Time Analytics, Network Monitoring,
Routing and Switching,…
32
Big Data – Structured vs. Unstructured Data
34
Features Structured Data Unstructured Data
Representation Discrete rows and columns
Less defined boundaries and easily
addressable
Storage
Rational Databases or
Spreadsheets
Unmanaged file structured
Metadata Syntax Semantics
Integration
Tools
ETL or ELT
Batch processing or manual data
entry that involves codes
Standard SQL, ADO.NET, ODBC,...
OpenXML, JSON, SMTP, SMS,
CSV,...
Databases MSSQL, Oracle, Excel,… Hadoop, HDInsight, MongoDB,…
Content Typically Text
Text, Images, Audio, Video,
Documents
Big Data – Cloud Computing Services
35
SaaS / DaaS
PaaS
IaaS
Big Data – Cloud Computing Services Continued
• IaaS: Infrastructure as a Service
• Servers, Virtual Machines, Storage, Load Balancers, Firewalls, Network
• PaaS: Platform as a Service
• Web Servers, Databases, Development Tools, Execution Runtime
• SaaS: Software as a Service
• CRM, ERP, Email, Virtual Desktop, Communications, Games
• DaaS: Data as a Service (Free or Commercial)
• Stocks, Forex, Google Map, Reddit, Twitter Demographic Data
36
Big Data – Cloud Service Providers
• Google Big Data Solutions
• Amazon Public Elastic Cloud
• Microsoft Azure
• OpenStack by Rackspace and NASA
• IBM Big Data Solutions
• Cloudera
• Oracle Cloud Platform
• Hortonworks
• SAP Big Data
37
Big Data – Hadoop
• Apache Hadoop (pronunciation: /həˈduːp/) is an open-source software
framework for distributed storage and distributed processing of very large
data sets on computer clusters built from commodity hardware. All the
modules in Hadoop are designed with a fundamental assumption that
hardware failures are common and should be automatically handled by the
framework. (Wikipedia)
• History: Doug Cutting, Mike Cafarella and team took the solution provided
by Google and started an Open Source Project called HADOOP in 2005 and
Doug named it after his son's toy elephant. Now Apache Hadoop is a
registered trademark of the Apache Software Foundation.
• Hadoop is a Free and Open Source Project
39
Big Data – Hadoop Components
• HDFS: The Hadoop distributed File System, used to store files across
many computers
• MapReduce:
• Map splits a task into pieces
• Reduce combines the output
• Has been replaced by YARN (Known as MapReduce 2)
• YARN: Can do Batch Processing like MapReduce and also Stream
Processing and Graph Processing unlike MapReduce
41
Big Data – Hadoop Components Continued
• Pig: Writes MapReduce programs, uses the Pig Latin programming
language
• Hive: Summarizes queries, Analyzes data, uses the HiveQL
programming language
• HBase: A NoSQL, not relational, not only SQL database
• Storm: Processing and Streaming data
• Spark: In Memory Processing (HDD to RAM)
• Giraph: Graph Processing for Social Networks data
42
Big Data – Landscape
43
http://www.hzahed.com/post/big-data-landscape
Big Data –Who Uses Hadoop
• Google
• Yahoo!
• LinkedIn
• Facebook
• Quantcast
• Amazon
• IBM
47
• ISI
• Spotify
• Twitter
• Adobe
• Ebay
• Alibaba
• Many others
Big Data – ETL Definition
• ETL: Stands for Extract, Transform, Load
• Extract: The process of pulling data from storage such as a database
• Transform: The process of putting data into a common format
• Load: The process of loading data into software for analysis
48
Extract Transform Load
Big Data – ETL in Hadoop
• ETL in Hadoop works differently from common databases
• Data starts and ends in Hadoop
• Hadoop can handle different formats
• It doesn’t require as much inspection
• No need to be aware of or worry about ETL processes in Hadoop
• Make it a point to inspect data
49
Big Data – Monitoring & Anomaly
• Monitoring
• Detects specific events
• Needs specific criterion in advance
• Triggers automatic response
• Anomaly
• Notifies of “unusual activity”
• Based on flexible criterial
• Doesn’t trigger a response
• Instead, invites inspection
50
Big Data –Visualization – Human vs. Computers
• Computers spot certain patterns
• Computers excel at predictive models
• Computers excel at data mining
• Humans perceive and interpret better
• Humans vision still plays and important role
• Humans identify visual patterns
• Humans identify anomalies
• Humans seeing patterns across groups
• Humans interpret content of images better
• Humans identify Gestalt Test better
51
Big Data –Visualization – Best Practices
• Prettier graphs are not always better
• Never use a false third dimension
• Animated and interactive graphs can be distracting
• The goal of data visualization is insight
• Use proper chart formats for visualization
• Choosing the right color scheme (Qualitative, Sequential, Diverging)
• Make sure chart alone can tell your story
53
Big Data – Microsoft Excel Role
• Excel is the most common data tool
• Millions of people use it and know how to deal with it
• Professional data miners use it
• Excel can do real data science on its own
• ODBC interfaces can connect Excel directly to Hadoop
• Excel is great for sharing data results
• Excel includes interactive PivotTables, Sortable Worksheets, Graphics
and Charts
54
Big Data – Data Analytics (DA) Methods
• Machine Learning (ML)
• Pattern Recognition (PR)
• Data Mining (DM)
• Natural Language Processing (NLP)
• Information Retrieval (IR)
• Text Mining (TM)
• Predictive Analytics
• Business Intelligence (BI)
• Prescriptive Analytics 55
Big Data – Machine Learning (ML)
• Definition: Machine Learning (LM) is a subfield of computer science
(more particularly soft computing) that evolved from the study of
pattern recognition and computational learning theory in artificial
intelligence. In 1959, Arthur Samuel defined machine learning as a
"Field of study that gives computers the ability to learn without being
explicitly programmed". (Wikipedia)
• Examples: Recommendations, Classifications, Line Regression,
Clustering, Neural Networks
56
Big Data – Pattern Recognition (PR)
• Definition: Pattern Recognition (PR) is a branch of machine learning
that focuses on the recognition of patterns and regularities in data,
although it is in some cases considered to be nearly synonymous with
machine learning. Pattern recognition systems are in many cases
trained from labeled "training" data (supervised learning), but when no
labeled data are available other algorithms can be used to discover
previously unknown patterns (unsupervised learning). (Wikipedia)
• Examples: Face detection, fingerprint verification, screening for
tumors and cancers, shape recognition, navigation systems
57
Big Data – Data Mining (DM)
• Definition: Data Mining (DM) is an interdisciplinary subfield of
computer science. It is the computational process of discovering
patterns in large data sets involving methods at the intersection of
artificial intelligence, machine learning, statistics, and database
systems. The overall goal of the data mining process is to extract
information from a data set and transform it into an understandable
structure for further use. (Wikipedia)
• Examples: Anomaly Detection, Association Rule Learning, Clustering,
Classification, Regression, Summarization
58
Big Data – Natural Language Processing (NLP)
• Definition: Natural Language Processing (NLP) is a field of computer
science, artificial intelligence, and computational linguistics concerned
with the interactions between computers and human (natural)
languages. As such, NLP is related to the area of human–computer
interaction. (Wikipedia)
• Examples: Natural language understanding, enabling computers to
derive meaning from human or natural language input; and others
involve natural language generation. (SIRI, Cortana)
59
Big Data – Information Retrieval (IR)
• Definition: Information Retrieval (IR) is the activity of obtaining
information resources relevant to an information need from a collection
of information resources. Searches can be based on or on full-text (or
other content-based) indexing. (Wikipedia)
• Examples: Automated information retrieval systems are used to
reduce what has been called "information overload". Many universities
and public libraries use IR systems to provide access to books,
journals and other documents. Web search engines (Google & Bing)
are the most visible IR applications.
60
Big Data –Text Mining (TM)
• Definition: Text Mining (TM) also referred to as text data mining, roughly
equivalent to text analytics, refers to the process of deriving high-quality
information from text. High-quality information is typically derived through the
devising of patterns and trends through means such as statistical pattern
learning. Text mining usually involves the process of structuring the input text
(usually parsing, along with the addition of some derived linguistic features
and the removal of others, and subsequent insertion into a database),
deriving patterns within the structured data, and finally evaluation and
interpretation of the output. (Wikipedia)
• Examples: Enterprise Business Intelligence/Data Mining, Competitive
Intelligence, National Security/Intelligence, Publishing, Social Media
Monitoring, Search/Information Access, Natural Language/Semantic Toolkit
or Service, Sentiment Analysis Tools, Listening Platforms
61
Big Data – Predictive Analytics
• Definition: Predictive Analytics encompasses a variety of statistical
techniques from predictive modeling, machine learning, and data mining that
analyze current and historical facts to make predictions about future or
otherwise unknown events. In business, predictive models exploit patterns
found in historical and transactional data to identify risks and opportunities.
Models capture relationships among many factors to allow assessment of
risk or potential associated with a particular set of conditions, guiding
decision making for candidate transactions. (Wikipedia)
• Examples: Actuarial Science, Marketing, Financial Services, Insurance,
Telecommunications, Retail, Travel, Healthcare, Child Protection,
Pharmaceuticals, Capacity Planning
62
Big Data – Business Intelligence (BI)
• Definition: Business Intelligence (BI) can be described as "a set of
techniques and tools for the acquisition and transformation of raw data into
meaningful and useful information for business analysis purposes". The term
"data surfacing" is also more often associated with BI functionality. BI
technologies are capable of handling large amounts of unstructured data to
help identify, develop and otherwise create new strategic business
opportunities. The goal of BI is to allow for the easy interpretation of these
large volumes of data. Identifying new opportunities and implementing an
effective strategy based on insights can provide businesses with a
competitive market advantage and long-term stability. (Wikipedia)
• Examples: Measurement, Analytics, Enterprise Reporting, Collaboration
Platform, Knowledge management
63
Big Data – Prescriptive Analytics
• Definition: Prescriptive analytics is the third and final phase of
business analytics (BA) which includes descriptive, predictive and
prescriptive analytics. Predictive analytics answers the question what
will happen. This is when historical performance data is combined with
rules, algorithms, and occasionally external data to determine the
probable future outcome of an event or the likelihood of a situation
occurring. The final phase is prescriptive analytics, which goes beyond
predicting future outcomes by also suggesting actions to benefit from
the predictions and showing the implications of each decision option.
(Wikipedia)
64
Big Data – Programming Languages
69
6.3%
8.1%
8.5%
8.8%
12.4%
30.6%
35.0%
36.4%
49.0%
MATLAB
SPSS
PIG / HIVEQL
UNIX SHELL
JAVA
SQL
PYTHON
SAS
R
Big Data – NoSQL Databases
70
Database Type Vendors
Wide Column Store
Hadoop HBase, Cassandra, Hortonworks, Cloudera,
Amazon SimpleDB, IBM Informix
Document Store
Elastic, MongoDB, Azure DocumentDB, Terrastore,
JSON ODM
Key Value / Tuple Store
Azmazon DynamoDB, Azure Table Storage,
Oracle NoSQL Database, Genomu
Graph Databases Neo4J, Infinite Graph, Sparksee, InfoGrid, GraphBase
Multimodel Databases ArangoDB, OrientDB, RockallDB, FoundationDB
Object Databases
Versant, db4o, Objectivity, Startcounter, Perst, HSS
Database, Magma, EyeDB, NDatabase, ObjectDB
Big Data – NoSQL Databases Continued
71
Database Type Vendors
Grid & Cloud Database
Solutions
Crate Data, Oracle Coherence,
GigaSpaces, Infinispan
XML Databases
EMC Documentum xDB, eXist, Senda,
BaseX, QizX, Berkeley DB XML
Multidimensional
Databases
Globals, SciDB, MiniM DB, DaggerDB
Multivalue Databases U2, OpenInsight, Reality, OpenQM, ESENT
Event Sourcing Event Store, ES4J
Time Series /
Streaming Databases
Axibase, Influxdata, kdb+
Other NoSQL
Databases
IBM Lutos, eXteremeDB, Yserial, BayesDB,
GPUdb, CodernityDB
Big Data – 10 Interesting Facts
1. Every 2 days we create as much information as we did from the beginning
of time until 2003.
2. Over 90% of all the data in the world was created in the past 2 years.
3. It is expected that by 2020 the amount of digital information in existence will
have grown from 3.2 zettabytes today to 40 zettabytes.
4. The total amount of data being captured and stored by industry doubles
every 1.2 years.
5. Every minute we send 204 million emails, generate 1.8 million Facebook
likes, send 278 thousand Tweets, and upload 200 thousand photos to
Facebook.
72
Big Data – 10 Interesting Facts
6. Google alone processes on average over 40 thousand search queries per
second, making it over 3.5 billion in a single day.
7. Around 100 hours of video are uploaded to YouTube every minute and it
would take you around 15 years to watch every video uploaded by users in
one day.
8. Facebook users share 30 billion pieces of content between them every day.
9. AT&T is thought to hold the world’s largest volume of data in one unique
database – its phone records database is 312 terabytes in size, and
contains almost 2 trillion rows.
10.The amount of data transferred over mobile networks increased by 81% to
1.5 Exabyte’s (1.5 billion gigabytes) per month between 2012 and 2014.
Video accounts for 53% of that total.
73
Big Data – 10 Interesting Insights
1. “The world is one big data problem.” – Andrew McAfee
2. “In God we trust. All others must bring data.” – W. Edwards Deming
3. “Torture the data, and it will confess to anything.” – Ronald Coase
4. “Information is the oil of the 21st century, and analytics is the
combustion engine.” - Peter Sondergaard
5. “It’s easy to lie with statistics. It’s hard to tell the truth without
statistics.” – Andrejs Dunkels
74
Big Data – 10 Interesting Insights
6. “The goal is to turn data into information, and information into
insight.” – Carly Fiorina
7. “The most valuable commodity I know of is information.” – Gordon
Gekko
8. “Data really powers everything that we do.” – Jeff Weiner
9. “Numbers have an important story to tell. They rely on you to give
them a voice.” – Stephen Few
10.“Data beats emotions.” – Sean Rad
75
Big Data – Free Data Sources
• Google Trends: www.google.com/trends/explore
• Google Finance: www.google.com/finance
• Google Freebase: developers.google.com/freebase
• Wikipedia Content: en.wikipedia.org/wiki/Wikipedia:Database_download
• U.S. Government Open Data: www.data.gov
• Quandl: www.quandl.com
• World Health Organization: www.who.int/gho/database/en
• Amazon Public Datasets: aws.amazon.com/datasets
• Facebook Graph: developers.facebook.com/docs/graph-api
• UNICEF: www.unicef.org/statistics/ 76
Big Data – KeyTerms & Glossary
• Algorithm
• Analytics Platform
• Apache Hive
• Behavioral Analytics
• Big Data Analytics
• Business Intelligence
• Cascading
• Cloud Computing
• Concurrency /
Concurrent computing
• Cluster Analysis
• Comparative Analysis
77
• Internet of Things (IOT)
• Machine Learning
• Metadata
• Natural Language
Processing
• Pattern Recognition
• Petabyte
• Predictive Analytics
• Prescriptive Analytics
• Semi-structured Data
• Sentiment Analysis
• Terabyte
• Connection Analytics
• Correlation Analysis
• Data Analyst
• Data Cleansing
• Data Mining
• Data Model / Data
Modeling
• Data Warehouse
• Descriptive Analytics
• ETL
• Hadoop
• Exabyte
http://bigdata.teradata.com/US/Big-Data-Quick-Start/Glossary