SlideShare uma empresa Scribd logo
1 de 147
BIG DATA APPLICATIONS & ANALYTICS
MOTIVATION:
BIG DATA AND THE CLOUD;
CENTERPIECES OF THE FUTURE ECONOMY
11/26/2014Course Motivation 1
Geoffrey Fox
November 25 2014
gcf@indiana.edu
http://www.infomall.org
School of Informatics and Computing
Digital Science Center
Indiana University Bloomington
Course Motivation
General Remarks including Hype curves
INTRODUCTION
11/26/20142
• There is an endlessly growing amount of data as we record
every transaction between people and the environment
(whether shopping or on a social networking site) while smart
phones, smart homes, ubiquitous cities, smart power grids,
and intelligent vehicles deploy sensors recording even more.
• Science with satellites and accelerators is giving data on
transactions of particles and photons at the microscopic
scale.
• This data are and will be stored in immense clouds with co-
located storage and computing that perform "analytics" that
transform data into information and then to wisdom and
decisions; data mining finds the proverbial knowledge
diamonds in the data rough.
• This disruptive transformation is driving the economy and
creating millions of jobs in the emerging area of "data
science".
• We discuss this revolution and its implications for
universities and society
ABSTRACT
11/26/2014
Course Motivation
3
The Data Deluge is clear trend from Commercial
(Amazon, e-commerce) , Community (Facebook,
Search) and Scientific applications
Smaller (INTEL/ARM/AMD) chips drive
Multicore (i.e. more computing) on shared servers
Smaller Light weight clients from smartphones,
tablets to sensors (i.e. more clients)
Clouds with cheaper, greener, easier to use IT
for applications
New jobs associated with new curricula
Clouds as a distributed system (changing a classic CS
course)
Data Science (new area)
SOME TRENDS
11/26/2014
Course Motivation
4
48 technologies are listed in this year’s hype cycle which is the highest in last ten years.
Year 2008 was the lowest (27)
Gartner Says in 2012: We are at an interesting moment — a time when the scenarios we’ve
been talking about for a long time are almost becoming reality.
11/26/20145
Course Motivation
11/26/20146
Course Motivation
Private Cloud Computing is off
the chart
http://public.brighttalk.com/resource/core/19507/august_21_hype_cycle_fenn_lehong_29685.pdf
11/26/20147
GARTNER EMERGING TECHNOLOGY HYPE
CYCLE 2014
Course Motivation
11/26/20148
Course Motivation
http://public.brighttalk.com/resource/core/19507/august_21_hype_cycle_fenn_lehong_29685.pdf
2013
11/26/20149
Course Motivation
2013
http://public.brighttalk.com/resource/core/19507/august_21_hype_cycle_fenn_lehong_29685.pdf
11/26/201410
Course Motivation
Note number
of “analytics”
areas
http://public.brighttalk.com/resource/core/19507/august_21_hype_cycle_fenn_lehong_29685.pdf
http://www.kpcb.com/internet-trends11/26/201411
Course Motivation
• Economic Imperative: There are a lot of data and
a lot of jobs
• Computing Model: Industry adopted clouds which
are attractive for data analytics
• Research Model: 4th Paradigm; From Theory to
Data driven science?
• Research/Business opportunities in advancing
computing technologies and algorithms
• Research/Business opportunities in X-Informatics:
applying 4th paradigm (more here!)
• Development in Data Science Education:
opportunities at universities
ISSUES OF IMPORTANCE
11/26/2014
Course Motivation
12
Course Motivation
DATA DELUGE
11/26/201413
http://www.kpcb.com/internet-trends
My Research focus is Science Big Data but note
Note largest science ~100 petabytes = 0.000025 total
Note 7 ZB (7. 1021) is about a
terabyte (1012) for each person
in world
Zettabyte ~1010 Typical Local Storage (100 Gigabytes)
Zettabyte = 1000 Exabytes
Exabyte = 1000 Petabytes
Petabyte = 1000 Terabyte
Terabyte = 1000 Gigabytes
Gigabyte = 1000 Megabytes
11/26/201414
Course Motivation
11/26/201415
Course Motivation
http://www.kpcb.com/internet-trends
11/26/201416
Course Motivation
11/26/201417
Course Motivation
Meeker/Wu May 29 2013 Internet Trends D11 Conference
20 hours
https://www.youtube.com/yt/press/statistics.html
http://www.kpcb.com/internet-trends
11/26/201418
Course Motivation
http://cs.metrostate.edu/~sbd/ Oracle11/26/201419
Course Motivation
11/26/201420
Course Motivation
• Web Data (“the original big data”)
• Analyze customer web browsing of e-commerce site to see topics looked at etc.
• Auto Insurance (telematics monitoring driving)
• Equip cars with sensors
• Text data in multiple industries
• Sentiment analysis, identify common issues (as in eBay lamp example), Natural Language
processing
• Time and location (GPS) data
• Track trucks (delivery), vehicles(track), people(tell them nearby goodies)
• Retail and manufacturing: RFID
• Asset and inventory management,
• Utility industry: Smart Grid
• Sensors allow dynamic optimization of power
• Gaming industry: Casino Chip tracking (RFID)
• Track individual players, detect fraud, identify patterns
• Industrial engines and equipment: sensor data
• See GE engine
• Video games: telemetry
• This is like monitoring web browsing but rather monitor actions in a game
• Telecommunication and other industries: Social Network data
• Connections make this big data.
• Use connections to find new customers with similar interests
“TAMING THE BIG DATA TIDAL WAVE” 2012
(BILL FRANKS, CHIEF ANALYTICS OFFICER TERADATA)
11/26/2014
Course Motivation
21
Ruh VP Software GE
http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html
11/26/201422
Course Motivation
Ruh VP Software GE
http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html
MM = Million
11/26/201423
Course Motivation
• LHC Particle Physics 15 petabytes per year
• Radiology 69 petabytes per year
• Square Kilometer Array Telescope will be 0.5
zettabytes per year raw data in ~2022
• Earth Observation becoming ~4 petabytes per
year
• Earthquake Science – few terabytes total today
• PolarGrid Radar studies of glaciers– 100’s
terabytes/year
• Exascale simulation data dumps – ~0.1 zettabyte
per year
SOME SCIENCE/TECHNICAL DATA SIZES
11/26/2014
Course Motivation
24
http://www.kpcb.com/internet-trends
http://www.genome.gov/sequencingcosts/
Need cost effective
Computing!
Sequence every
newborn by 2019 gives
100 petabytes/year
11/26/201425
Course Motivation
The Long Tail of Science
80-20 rule: 20% users generate 80% data but not necessarily
80% knowledge
Collectively “long tail” science is generating a lot of data
Estimated at over 1PB per year and it is growing fast.
CSTI Meeting. October
2012 Dennis Gannon
11/26/201426
Course Motivation
• Particle Physics LHC (bag of events of particles)
• Information Retrieval or web search (bag of words)
• e-commerce (bag of items with properties or users with
rankings)
• Social Networking (bag of people with links & properties)
• Health Informatics (bag of health records, gene sequences)
• Sensors – web cams, self driving cars etc. (bag of pixels)
• Using
• Statistics (Histograms, Chisq)
• Deep Learning (Machine Learning)
• Image Analysis (including internet uploaded images)
• Recommender Engines (Bag of Ratings or properties)
• Patterns or Anomaly detection in graphs (linked data)
• On Clouds using MapReduce etc.
DATA INTENSIVE ACTIVITIES
Bag=
Space
11/26/2014
Course Motivation
27
• Use Clouds running Data Analytics
Collaboratively processing Big Data
to solve problems in
X-Informatics ( or e-X)
• X = Astronomy, Biology, Biomedicine, Business, Chemistry,
Climate, Crisis, Earth Science, Energy, Environment, Finance,
Health, Intelligence, Lifestyle, Marketing, Medicine, Pathology,
Policy, Radar, Security, Sensor, Social, Sustainability, Wealth and
Wellness with more fields (physics) defined implicitly
• Spans Industry and Science (research)
• Education: Data Science
see recent New York Times articles
• http://datascience101.wordpress.com/2013/04/13/new-
york-times-data-science-articles/
BIG DATA ECOSYSTEM IN ONE SENTENCE
11/26/2014
Course Motivation
28
Social Informatics
Visual&Decision
Informatics
11/26/201429
Course Motivation
Course Motivation
Data Science, Clouds and Computer Science
JOBS
11/26/201430
JOBS V. COUNTRIES
11/26/201431
Course Motivation
http://www.microsoft.com/en-us/news/features/2012/mar12/03-
05CloudComputingJobs.aspx
• There will be a shortage of talent necessary for organizations to take
advantage of big data. By 2018, the United States alone could face a
shortage of 140,000 to 190,000 people with deep analytical skills as well
as 1.5 million managers and analysts with the know-how to use the
analysis of big data to make effective decisions.
• Informatics aimed at 1.5 million jobs. Computer Science covers the
140,000 to 190,000
MCKINSEY INSTITUTE ON BIG DATA JOBS
Course Motivation
3211/26/2014
http://www.mckinsey.com/mgi/publications/
big_data/index.asp.
Tom Davenport Harvard Business School
http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html Nov 2012
11/26/201433
Course Motivation
11/26/201434
Course Motivation
Meeker/Wu May 29 2013 Internet Trends D11 Conference
11/26/201435
Course Motivation
Meeker/Wu May 29 2013 Internet Trends D11 Conference
Course Motivation
Many Technology trends
INDUSTRY TRENDS
11/26/201436
http://www.kpcb.com/internet-trends
Note that translates NOW into smaller
devices
In PAST translated into faster devices of
same form factor
11/26/201437
Course Motivation
11/26/201438
Course Motivation
http://www.kpcb.com/internet-trends
http://www.kpcb.com/internet-trends
11/26/201439
Course Motivation
11/26/201440
Course Motivation
Meeker/Wu May 29 2013 Internet Trends D11 Conference
11/26/201441
Course Motivation
Meeker/Wu May 29 2013 Internet Trends D11 Conference
11/26/201442
Course Motivation
Meeker/Wu May 29 2013 Internet Trends D11 Conference
11/26/201443
Course Motivation
Meeker/Wu May 29 2013 Internet Trends D11 Conference
11/26/201444
Course Motivation
Meeker/Wu May 29 2013 Internet Trends D11 Conference
11/26/201445
Course Motivation
Meeker/Wu May 29 2013 Internet Trends D11 Conference
http://www.kpcb.com/internet-trends
11/26/201446
Course Motivation
http://www.kpcb.com/internet-trends11/26/201447
Course Motivation
11/26/201448
Course Motivation
11/26/201449
Course Motivation
Meeker/Wu May 29 2013 Internet Trends D11 Conference
11/26/201450
Course Motivation
11/26/201451
Course Motivation
Course Motivation
Displaced by Digital Disruption
THE PAST?
11/26/201452
11/26/201453
Course Motivation
Meeker/Wu May 29 2013 Internet Trends D11 Conference
11/26/201454
Course Motivation
http://sephlawless.com/black-friday-2014 No more malls?
11/26/201455
Course Motivation
No more malls?
WHERE ARE SHOPPERS GOING?
11/26/201456
Course Motivation
11/26/201457
Course Motivation
We Are Here 2014-2015
11/26/201458
E-COMMERCE IS DRIVING NEARLY ALL RETAIL GROWTH IN US
Course Motivation
11/26/201459
1 IN 20 RETAIL DOLLARS ARE ALREADY ONLINE
Course Motivation
Even online groceries taking off
11/26/201460
Course Motivation
Course Motivation
Industry adopted clouds which are attractive for
data analytics
COMPUTING MODEL
11/26/201461
For last 5 years Cloud Computing and last 2 years Big Data Transformational
Note in 2013 Big Data moves to 5-10 year slot
11/26/201462
Course Motivation
• It took Amazon Web Services (AWS) eight years to
hit $650 million in revenue, according to Citigroup
in 2010.
• Just three years later, Macquarie Capital analyst
Ben Schachter estimates that AWS will top $3.8
billion in 2013 revenue, up from $2.1 billion in 2012
(estimated), valuing the AWS business at $19
billion.
• First public cloud computing supplier building on
many cloud systems used to run Amazon, Google,
Bing, eBay ….
AMAZON CLOUD AWS MAKING MONEY
11/26/2014
Course Motivation
63
• A bunch of computers in an efficient data center
with an excellent Internet connection
• They were produced to meet need of public-
facing Web 2.0 e-Commerce/Social Networking
sites
• They can be considered as “optimal giant data
center” plus internet connection
• Note enterprises use private clouds that are giant
data centers but not optimized for Internet access
OPERATIONALLY CLOUDS ARE CLEAR
11/26/2014
Course Motivation
64
THE MICROSOFT CLOUD IS BUILT ON DATA CENTERS
11/26/201465
Course Motivation Quincy, WA Chicago, IL San Antonio, TX Dublin, Ireland Generation 4 DCs
Range in size from “edge” facilities to megascale (100K to 1M
servers). Giant data centers with ~ 200-1000 to a shipping
container with Internet access.
~100 Globally Distributed Data Centers
CSTI Meeting. October 2012 Dennis Gannon
DATA CENTERS CLOUDS & ECONOMIES OF SCALE
11/26/201466
Range in size from “edge”
facilities to megascale.
Economies of scale:
Approximate costs for a small
size center (1K servers) and a
larger, 50K server center.
Course Motivation
Each data center is
11.5 times
the size of a football field
Technology Cost in small-
sized Data
Center
Cost in Large
Data Center
Ratio
Network $95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage $2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration ~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
http://research.microsoft.com/en-us/people/barga/sc09_cloudcomp_tutorial.pdf
2 Google warehouses of
computers on the banks of the
Columbia River, in The Dalles,
Oregon
Such centers use 20MW-200MW
(Future) each with 150 watts
per CPU
Save money from large size,
positioning with cheap power
and access with Internet
• Virtualization = abstraction; run a job – you know
not where
• Virtualization = use hypervisor to support “images”
• Allows you to define complete job as an “image” – OS +
application
• Efficient packing of multiple applications into one
server as they don’t interfere (much) with each
other if in different virtual machines;
• They interfere if put as two jobs in same machine
as for example must have same OS and same OS
services
• Also security model between VM’s more robust
than between processes
VIRTUALIZATION MADE SEVERAL THINGS
MORE CONVENIENT
11/26/2014
Course Motivation
67
• http://research.microsoft.com/pubs/78813/AJ18_E
N.pdf
• Typical data center CPU has 9.75% utilization
• Take 5000 SQL servers and rehost on virtual
machines with 6:1 consolidation
MICROSOFT SERVER CONSOLIDATION
60% saving
11/26/2014
Course Motivation
68
• http://www.google.com/green/pdfs/google-
green-computing.pdf
• Clouds win by efficient resource use and efficient
data centers
THE GOOGLE GMAIL EXAMPLE
Business
Type
Number
of users
# servers IT Power
per user
PUE (Power
Usage
effectivene
ss)
Total
Power
per user
Annual
Energy
per user
Small 50 2 8W 2.5 20W 175 kWh
Medium 500 2 1.8W 1.8 3.2W 28.4 kWh
Large 10000 12 0.54W 1.6 0.9W 7.6 kWh
Gmail
(Cloud)
  < 0.22W 1.16 < 0.25W < 2.2 kWh
11/26/2014
Course Motivation
69
• Features from NIST:
• On-demand service (elastic);
• Broad network access;
• Resource pooling;
• Flexible resource allocation;
• Measured service
• Economies of scale in performance and electrical power
(Green IT)
• Powerful new software models
• Platform as a Service is not an alternative to
Infrastructure as a Service – it is instead an incredible
valued added
• Amazon is as much PaaS as Azure
• They are cheaper than classic clusters unless latter 100%
utilized
CLOUDS OFFER FROM DIFFERENT POINTS OF VIEW
11/26/2014
Course Motivation
70
BPM = Business Process management
IaaS Hardware e.g. Server
PaaS Systems Services e.g.
MapReduce, Database
SaaS Applications e.g.
Recommender System
BPaaS Particular
Application Set
11/26/201471
Course Motivation
11/26/201472
Course Motivation
http://www.gartner.com/technology/reprints.do?
id=1-1UKQQA6&ct=140528&st=sb
Gartner Magic
Quadrant for Cloud
Infrastructure as a
Service
AWS in lead
followed by Azure
Course Motivation
4th Paradigm; From Theory to Data driven science?
RESEARCH MODEL
11/26/201473
http://www.wired.com/wired/issue/16-07 September 2008
11/26/201474
Course Motivation
1. Theory
2. Experiment or Observation
• E.g. Newton observed apples falling to design
his theory of mechanics
3. Simulation of theory or model Supercomputers
4. Data-driven (Big Data) or The Fourth Paradigm:
Data-Intensive Scientific Discovery (aka Data
Science)
• http://research.microsoft.com/en-
us/collaboration/fourthparadigm/ A free book
• More data; less models
THE 4 PARADIGMS OF SCIENTIFIC
RESEARCH
11/26/2014
Course Motivation
75
MORE DATA USUALLY BEATS BETTER ALGORITHMS
Course Motivation
7611/26/2014
• Here's how the competition works. Netflix has
provided a large data set that tells you how nearly
half a million people have rated about 18,000 movies.
Based on these ratings, you are asked to predict the
ratings of these users for movies in the set that they
have not rated. The first team to beat the accuracy
of Netflix's proprietary algorithm by a certain margin
wins a prize of $1 million!
• Different student teams in my class adopted different
approaches to the problem, using both published
algorithms and novel ideas. Of these, the results from
two of the teams illustrate a broader point. Team A
came up with a very sophisticated algorithm using
the Netflix data. Team B used a very simple algorithm,
but they added in additional data beyond the Netflix
set: information about movie genres from the Internet
Movie Database(IMDB). Guess which team did
better?
• Anand Rajaraman is Senior Vice President at Walmart
Global eCommerce, where he heads up the newly
created @WalmartLabs,
• http://anand.typepad.com/datawocky/2008/03/more-data-usual.html
• 20120117berkeley1.pdf Jeff Hammerbacher
Course Motivation
DIKW
Data
Information
Knowledge
Wisdom
Decisions
DATA SCIENCE PROCESS
11/26/201477
• Data becomes
• Information becomes
• Knowledge becomes
• Wisdom or Decisions
• Community acceptance of results or approach
important here
• Volume of bits&bytes decreases as we proceed
down DIKW pipeline
DIKW PROCESS
11/26/2014
Course Motivation
78
79
Course Motivation
Database
SS SS SS SS SS SS
SS: Sensor or Data
Interchange
Service
Workflow
through multiple
filter/discovery
clouds
Another
Cloud
Raw Data  Data  Information  Knowledge  Wisdom  Decisions
SSSS
Another
Service
SS
Another
Grid SS
SS
SS
SS
SS
SS
SS
SS
Storage
Cloud
Compute
Cloud
SS
SSSS
SS
Filter
Cloud
Filter
Cloud
Filter
Cloud
Discovery
Cloud
Discovery
Cloud
Filter
Cloud
Filter
Cloud
Filter
Cloud
SS
Filter
Cloud
Filter
Cloud Filter
Cloud
Filter
Cloud
Distributed
Grid
Hadoop
Cluster
SS
Data Deluge is also Information/Knowledge/Wisdom/Decision Deluge?
11/26/2014
• Data comes from traditional maps (US Geological
Survey), Satellites (overlays) and street cams
• Information is presented by basic Google Maps
web page
• Knowledge is a particular optimized route
• Decisions (Wisdom) comes from deciding to drive
a particular route
EXAMPLE OF GOOGLE
MAPS/NAVIGATION
11/26/2014
Course Motivation
80
Course Motivation
Application Example
PHYSICS-INFORMATICS
LOOKING FOR HIGGS PARTICLE
WITH LARGE HADRON COLLIDER LHC
11/26/201481
11/26/201482
Course Motivation
The LHC produces some 15 petabytes of data per year of all varieties and with the exact value
depending on duty factor of accelerator (which is reduced simply to cut electricity cost but also
due to malfunction of one or more of the many complex systems) and experiments. The raw data
produced by experiments is processed on the LHC Computing Grid, which has some 200,000
Cores arranged in a three level structure. Tier-0 is CERN itself, Tier 1 are national facilities and Tier
2 are regional systems. For example one LHC experiment (CMS) has 7 Tier-1 and 50 Tier-2
facilities.
This analysis raw data  reconstructed data  AOD
and TAGS  Physics is performed on the multi-tier
LHC Computing Grid. Note that every event can be
analyzed independently so that many events can be
processed in parallel with some concentration
operations such as those to gather entries in a
histogram. This implies that both Grid and Cloud
solutions work with this type of data with currently
Grids being the only implementation today. Higgs Event
http://grids.ucs.indiana.edu/ptliupages/publications/Where%20does%20all%20the%20data%20come%20from%20v7.pdf
Note LHC lies in a
tunnel 27
kilometres (17 mi)
in circumference
ATLAS Expt
http://www.interactions.org/cms/?pid=1032811
The inside of the RHIC (Relativistic
Heavy Ion Collider) tunnel, a 2.4-mile
high-tech particle racetrack at
Brookhaven National Laboratory.
11/26/201483
Course Motivation
Model
http://www.quantumdiaries.org/2012/09/07/why-particle-detectors-need-a-trigger/atlasmgg/
11/26/201484
Course Motivation
• As a naïve undergraduate in 1964, I was told by Professor who later left
university to enter church that bumps like
were particles. I was amazed and found this more intriguing than
anything else I had heard about so I decided to do PhD in particle
physics.
• I later decided computing moving faster than physics, so I went into
Informatics
• Also I was alarmed by size and time scale of physics activities
• Note ATLAS is 45 metres long, 25 metres in diameter, and weighs about
7,000 tons. The experiment is a collaboration involving roughly 3,000
physicists at 175 institutions in 38 countries
• US version of LHC, Superconducting Super Collider (SSC) discussed in
1983 was cancelled in 1993 after $2B spent
PERSONAL NOTE
11/26/2014
Course Motivation
85
http://www.sciencedirect.
com/science/article/pii/S
037026931200857X
11/26/201486
Course Motivation
Course Motivation
Technology Example
RECOMMENDER SYSTEMS I
11/26/201487
• In many cases, one needs personalized
matching of items to people or perhaps
collections of items to collections of people
• People to products: Online and Offline
Commerce
• People to People: Social Networking
• People to Jobs or Employers: Job Sites
• People+Queries to the Web: Information
Retrieval (search as in Bing/Google)
OVERVIEW OF MANY INFORMATICS
AREAS
11/26/2014
Course Motivation
88
• A large number of online and offline commerce activities plus
basic Internet site personalization relies on “recommender
systems”
• Given real-time action by user, immediately suggest new
actions (as in Amazon buy recommendations on web)
• Based on past actions of users (and others) suggest movies to
look at, restaurants to eat at, events to go to, books and
music to buy
• Based on mix of explicit user choice and grouping of internet
sites, present customized Google News page
• Given sales statistics, decide on discounts at “real”
supermarkets and placement of related (by analysis of buying
habits) products
• Identify possible colleagues at Social Networking sites like
LinkedIn
• Identify matches between employers and employees at sites
like CareerBuilder and Monster
RECOMMENDER SYSTEMS IN MORE DETAIL
11/26/2014
Course Motivation
89
• Fit Model to Data
• Higgs + Background
• Match User to Jobs or Books or Other Users?
• Classification is optimizing assignment of members of an
ontology (list of categories) to data
• Typically minimize some function (or maximize negative of
function)
• Interesting feature of these problems is ingenious choice
of function
• Note Physics minimizes (free) energy
• Often involves thinking of people and/or items as points in
a space (not always a traditional vector space)
• Space called “bag” in “bag of words” model for
information retrieval
EVERYTHING IS AN OPTIMIZATION
PROBLEM?
11/26/2014
Course Motivation
90
NETFLIX ON PERSONALIZATION
11/26/201491
Course Motivation
http://www.slideshare.net/xamat/building-largescale-
realworld-recommender-systems-recsys2012-tutorial
NETFLIX ON RECOMMENDATIONS
11/26/201492
Course Motivation
http://www.slideshare.net/xamat/building-largescale-
realworld-recommender-systems-recsys2012-tutorial
APRIL 2013: THE LAST TWO QUARTERS HAVE EACH BROUGHT MORE THAN 2
MILLION NEW STREAMING SUBSCRIBER SIGNUPS. THAT GIVES NETFLIX A
CURRENT TOTAL OF NEARLY 29.2 MILLION SUBSCRIBERS
11/26/201493
Course Motivation
http://www.slideshare.net/xamat/building-largescale-realworld-recommender-
systems-recsys2012-tutorial
http://www.ifi.uzh.ch/ce/teaching/spring2
012/16-Recommender-Systems_Slides.pdf
11/26/201494
Course Motivation
NETFLIX ON DATA SCIENCE
11/26/201495
Course Motivation
Note Netflix and others
run tests all the time on
subsets of customers
http://www.slideshare.net/xamat/building-largescale-realworld-recommender-
systems-recsys2012-tutorial
Course Motivation
Use of Data bags and spaces
RECOMMENDER SYSTEMS II
11/26/201496
• In user-based collaborative filtering, we can think of users
in a space of dimension N where there are N items and M
users.
• Let i run over items and u over users
• Then each user is represented as a vector Ui(u) in “item-
space” where ratings are vector components. We are
looking for users u u’ that are near each other in this space
as measured by some distance between Ui(u) and Ui(u’)
• If u and u’ rate all items then these are “real” vectors but
almost always they each only rates a small fraction of items
and the number in common is even smaller
• The “Pearson coefficient” is just one distance measure that
can be used
• Only sum over i rated by u and u’
DISTANCES IN FUNNY SPACES I
Last.fm uses this for songs as does Amazon, Netflix11/26/2014
Course Motivation
97
• In item-based collaborative filtering, we can think of items
in a space of dimension M where there are N items and M
users.
• Let i run over items and u over users
• Then each item is represented as a vector Ru(i) in “user-
space” where ratings are vector components. We are
looking for items i i’ that are near each other in this space
as measured by some distance between Ru(i) and Ru(i’)
• If i and i’ rated by all users then these are “real” vectors but
almost always they are each only rated by a small fraction
of users and the number in common is even smaller
• The “Cosine measure” is just one distance measure that
can be used
• Only sum over users u rating both i and i’
DISTANCES IN FUNNY SPACES II
11/26/2014
Course Motivation
98
• In content based recommender systems, we can think of
items in a space of dimension M where there are N items
and M properties.
• Let i run over items and p over properties
• Then each item is represented as a vector Pp(i) in
“property-space” where values of properties are vector
components. We are looking for items i i’ that are near
each other in this space as measured by some distance
between Pp(i) and Rp(i’)
• Properties could be “reading age” or “character
highlighted” or “author” for books
• Properties can be genre or artist for songs and video
• Properties can characterize pixel structure for images used
in face recognition, driving etc.
DISTANCES IN FUNNY SPACES III
Pandora uses this for songs (Music Genome) as does Amazon, Netflix
11/26/2014
Course Motivation
99
• Much of (eCommerce/LifeStyle) Informatics involves “points”
• Events in LHC analysis
• Users (people) or items (books, jobs, music, other people)
• These points can be thought of being in a “space” or “bag”
• Set of all books
• Set of all physics reactions
• Set of all Internet users
• However as in recommender systems where a given user only
rates some items, we don’t know “full position”
• However we can nearly always define a useful distance d(a,b)
between points
• Always d(a,b) >= 0
• Usually d(a,b) = d(b,a)
• Rarely d(a,b) + d(b,c) >= d(a,c) Triangle Inequality
DO WE NEED “REAL” SPACES?
11/26/2014
Course Motivation
100
• The simplest way to use distances is “nearest
neighbor algorithms” – given one point, find a set
of points near it – cut off by number of identified
nearby points and/or distance to initial point
• Here point is either user or item
• Another approach is divide space into regions
(topics, latent factors) consisting of nearby points
• This is clustering
• Also other algorithms like Gaussian mixture
models or Latent Semantic Analysis or Latent
Dirichlet Allocation which use a more
sophisticated model
USING DISTANCES
11/26/2014
Course Motivation
101
Course Motivation
Another Example
WEB SEARCH
INFORMATION RETRIEVAL
11/26/2014102
• Get the digital data (from web or from scanning)
• Need to crawl web (? Solved “engineering” problem)
• Preprocess data to get searchable things (words positions)
• Form Inverted Index mapping words to documents
• Typically use TF-IDF (term frequency, Inverse Document
frequency) to quantify importance of word match
• Rank relevance of documents: PageRank
• Lots of technology for advertising, “reverse engineering”
“preventing reverse engineering”
• Clustering of documents into topics (as in Google News)
“WEB DATA ANALYTICS”
11/26/2014
Course Motivation
103
Size of face proportional
to PageRank11/26/2014104
Course Motivation
• Goal (Function to Optimize – Long Term dollars)
• Serve the right item to a user in a given context to
optimize long-term business objectives
• A scientific discipline that involves
• Large scale Machine Learning & Statistics
• Offline Models (capture global & stable characteristics)
• Online Models (incorporates dynamic components)
• Explore/Exploit (active and adaptive experimentation)
• Multi-Objective Optimization
• Click-rates (CTR), Engagement, advertising revenue, diversity,
etc
• Inferring user interest
• Constructing User Profiles
• Natural Language Processing to understand content
• Topics, “aboutness”, entities, follow-up of something, breaking
news,…
MODERN RECOMMENDATION SYSTEMS
(FROM YAHOO)
http://pages.cs.wisc.edu/~beechung/icml11-tutorial/11/26/2014
Course Motivation
105
11/26/2014106
Course Motivation
Recommend applications
Recommend search queries
Recommend news article
Recommend packages:
Image
Title, summary
Links to other pages
Pick 4 out of a pool of K
K = 20 ~ 40
Dynamic
Routes traffic other
pages
http://pages.cs.wisc.edu/~beechung/icml11-tutorial/
• Simple version
• I have a content module on my page, content inventory
is obtained from a third party source which is further
refined through editorial oversight. Can I algorithmically
recommend content on this module? I want to improve
overall click-rate (CTR) on this module
• More advanced
• I got X% lift in CTR. But I have additional information on
other downstream utilities (e.g. advertising revenue).
Can I increase downstream utility without losing too
many clicks?
• Highly advanced
• There are multiple modules running on my webpage.
How do I perform a simultaneous optimization?
SOME EXAMPLES FROM CONTENT
OPTIMIZATION
http://pages.cs.wisc.edu/~beechung/icml11-tutorial/11/26/2014
Course Motivation
107
Course Motivation
Science Clouds
Internet of Things
CLOUD APPLICATIONS IN RESEARCH
11/26/2014108
• Large Scale Supercomputers – Multicore nodes linked by
high performance low latency network
• Increasingly with GPU enhancement
• Suitable for highly parallel simulations
• High Throughput Systems such as European Grid Initiative
EGI or Open Science Grid OSG typically aimed at
pleasingly parallel jobs
• Can use “cycle stealing”
• Classic example is LHC data analysis
• Grids federate resources as in EGI/OSG or enable
convenient access to multiple backend systems including
supercomputers
• Use Services (SaaS)
• Portals make access convenient and
• Workflow integrates multiple processes into a single job
SCIENCE COMPUTING ENVIRONMENTS
11/26/2014
Course Motivation
109
• Synchronization/communication Performance
Grids > Clouds > Classic HPC Systems
• Clouds naturally execute effectively Grid workloads
but are less clear for closely coupled HPC applications
• Classic HPC machines as MPI engines offer highest
possible performance on closely coupled problems
• The 4 forms of MapReduce/MPI
1) Map Only – pleasingly parallel
2) Classic MapReduce as in Hadoop; single Map followed
by reduction with fault tolerant use of disk
3) Iterative MapReduce use for data mining such as
Expectation Maximization in clustering etc.; Cache
data in memory between iterations and support the
large collective communication (Reduce, Scatter,
Gather, Multicast) use in data mining
4) Classic MPI! Support small point to point messaging
efficiently as used in partial differential equation solvers
CLOUDS HPC AND GRIDS
11/26/2014
Course Motivation
110
• Pleasingly (moving to modestly) parallel
applications of all sorts with roughly independent
data or spawning independent simulations
• Long tail of science and integration of
distributed sensors
• Commercial and Science Data analytics that can
use MapReduce (some of such apps) or its
iterative variants (most other data analytics apps)
• Which science applications are using clouds?
• Venus-C (Azure in Europe): 27 applications not using
Scheduler, Workflow or MapReduce (except roll your
own)
• 50% of applications on FutureGrid were from Life Science
• Locally Lilly corporation is commercial cloud user (for
drug discovery) but not IU Biology
• But overall very little science use of clouds yet
WHAT APPLICATIONS WORK IN CLOUDS
11/26/2014
Course Motivation
111
• “Long tail of science” can be an important usage mode of
clouds.
• In some areas like particle physics and astronomy, i.e. “big
science”, there are just a few major instruments generating
now petascale data driving discovery in a coordinated
fashion.
• In other areas such as genomics and environmental
science, there are many “individual” researchers with
distributed collection and analysis of data whose total
data and processing needs can match the size of big
science.
• Clouds can provide scaling convenient resources for this
important aspect of science.
• Can be map only use of MapReduce if different usages
naturally linked e.g. exploring docking of multiple
chemicals or alignment of multiple DNA sequences
• Collecting together (summarizing) multiple “maps” is a Reduction
PARALLELISM OVER USERS AND USAGES
11/26/2014
Course Motivation
112
• It is projected that there will be 24-75 billion devices on the
Internet by 2020. Most will be small sensors that send
streams of information into the cloud where it will be
processed and integrated with other streams and turned
into knowledge that will help our lives in a multitude of small
and big ways.
• The cloud will become increasing important as a controller
of and resource provider for the Internet of Things.
• As well as today’s use for smart phone and gaming console
support, “Intelligent River” “smart homes and grid” and
“ubiquitous cities” build on this vision and we could expect
a growth in cloud supported/controlled robotics.
• Some of these “things” will be supporting science
• Natural parallelism over “things”
• “Things” are distributed and so form a Grid
INTERNET OF THINGS AND THE CLOUD
11/26/2014
Course Motivation
113
• We will look at several streaming examples later
but most of the use cases involve this. Streaming
can be seen in many ways
• There are devices – The Internet of Things and
various MEMS in smartphones
• There are people tweeting or logging in to the
cloud to search. These interactions are
streaming
• Apache Storm is critical software here to gather
and integrate multiple erratic streams
• Also important algorithm challenges to update
quickly analyses with streamed data
STREAMING CATEGORY
http://www.kpcb.com/internet-trends
11/26/2014
Course Motivation
114
11/26/2014115
HOME DEVICES
Course Motivation
11/26/2014116
SENSORS (THINGS) AS A SERVICE
Course Motivation
Sensors as a Service
Sensor
Processing as
a Service
(could use
MapReduce)
A larger sensor ………
Output Sensor
https://sites.google.com/site/opensourceiotcloud/ Open Source Sensor (IoT) Cloud
11/26/2014117
Course Motivation
Meeker/Wu May 29 2013 Internet Trends D11 Conference
11/26/2014118
Course Motivation
Meeker/Wu May 29 2013 Internet Trends D11 Conference
11/26/2014119
Course Motivation
Meeker/Wu May 29 2013 Internet Trends D11 Conference
Course Motivation
Software Ecosystems
PARALLEL COMPUTING AND
MAPREDUCE
11/26/2014120
11/26/2014121
Course Motivation
http://www.slideshare.net/mjft01/big-data-landscape-matt-turck-may-2014
There are a lot of Big Data and HPC Software systems
Challenge! Manage environment offering these different components
11/26/2014122
Course Motivation
• We don’t need 266 software packages so can choose e.g.
• Workflow: IPython, Pegasus or Kepler (replaced by tools like Tez?)
• Data Analytics: Mahout, R, ImageJ, Scalapack
• High level Programming: Hive, Pig
• Parallel Programming model: Hadoop, Spark, Giraph
(Twister4Azure, Harp), MPI; Storm, Kapfka or RabbitMQ (Sensors)
• In-memory: Memcached
• Data Management: Hbase, MongoDB, MySQL or Derby
• Distributed Coordination: Zookeeper
• Cluster Management: Yarn, Slurm
• File Systems: HDFS, Lustre
• DevOps: Cloudmesh, Chef, Puppet, Docker, Cobbler
• IaaS: Amazon, Azure, OpenStack, Libcloud
• Monitoring: Inca, Ganglia, Nagios
MAYBE A BIG DATA INITIATIVE WOULD INCLUDE
11/26/2014
Course Motivation
123
• HPC: Typically SPMD (Single Program Multiple Data) “maps”
typically processing particles or mesh points interspersed with
multitude of low latency messages supported by specialized
networks such as Infiniband and technologies like MPI
• Often run large capability jobs with 100K (going to 1.5M) cores
on same job
• National DoE/NSF/NASA facilities run 100% utilization
• Fault fragile and cannot tolerate “outlier maps” taking longer
than others
• Clouds: MapReduce has asynchronous maps typically
processing data points with results saved to disk. Final reduce
phase integrates results from different maps
• Fault tolerant and does not require map synchronization
• Map only useful special case
• HPC + Clouds: Iterative MapReduce caches results between
“MapReduce” steps and supports SPMD parallel computing with
large messages as seen in parallel kernels (linear algebra) in
clustering and other data mining
CLASSIC PARALLEL COMPUTING
11/26/2014
Course Motivation
124
MAPREDUCE “FILE/DATA REPOSITORY” PARALLELISM
11/26/2014125
Course Motivation
Instruments
Disks Map1 Map2 Map3
Reduce
Communication
Map = (data parallel) computation reading and writing data
Reduce = Collective/Consolidation phase e.g. forming multiple
global sums as in histogram
Portals
/Users
Iterative MapReduce
Map Map Map Map
Reduce Reduce Reduce
SAM’S PROBLEM
HTTP://WWW.SLIDESHARE.NET/ESALIYA/MAPREDUCE-IN-SIMPLE-TERMS
11/26/2014126
Course Motivation
• Sam thought of “drinking” the apple
 He used a to cut the
and a to make juice.
CREATIVE SAM
11/26/2014127
Course Motivation
(<a’, > , <o’, > , <p’, > )
• Implemented a parallel version of his innovation
(<a, > , <o, > , <p, > , …)
Each input to a map is a list of <key, value> pairs
Each output of slice is a list of <key, value> pairs
Grouped by key
Each input to a reduce is a <key, value-list> (possibly a
list of these, depending on the grouping/hashing
mechanism)
e.g. <ao, ( …)>
Reduced into a list of values
The idea of Map Reduce in Data Intensive
Computing
A list of <key, value> pairs mapped into another
list of <key, value> pairs which gets grouped by
the key and reduced into a list of values
Course Motivation
Opportunities at Universities
see New York Times articles
http://datascience101.wordpress.com/2013/04/13/new-york-times-data-science-
articles/
DATA SCIENCE EDUCATION
11/26/2014128
• Broad Range of Topics from Policy to curation to
applications and algorithms, programming
models, data systems, statistics, and broad range
of CS subjects such as Clouds, Programming, HCI,
• Plenty of Jobs and broader range of possibilities
than computational science but similar cosmic
issues
• What type of degree (Certificate, minor, track,
“real” degree)
• What implementation (department,
interdisciplinary group supporting education and
research program)
DATA SCIENCE EDUCATION
11/26/2014
Course Motivation
129
• Have set up certificate and Masters degree in data
science
• Joint between 3 units in School of Informatics and
Computing: Computer Science, Informatics, Information &
Library Science, and Statistics department in COAS College
• Certificate online
• Masters online or residential
• Looking at version with Kelley with Business data analytics
flavor
• Attractive to offer online as few universities have this and so
a potentially large audience outside IU
AT INDIANA UNIVERSITY
11/26/2014
Course Motivation
130
11/26/2014131
Course Motivation
Meeker/Wu May 29 2013 Internet Trends D11 Conference
11/26/2014132
Course Motivation
Meeker/Wu May 29 2013 Internet Trends D11 Conference
• MOOC’s are very “hot” these days with Udacity and
Coursera as start-ups; perhaps over 100,000 participants
• Relevant to Data Science as this is a new field with few
courses at most universities
• Typical model is collection of short prerecorded segments
(talking head over PowerPoint) of length 3-15 minutes
• This is Boredom limit http://blog.coursera.org/post/49750392396/on-
the-topic-of-boredom
• These “lesson objects” can be viewed as “songs”
• Google Course Builder (python open source) builds
customizable MOOC’s as “playlists” of “songs”
• Tells you to capture all material as “lesson objects”
• We are aiming to build a repository of many “songs”; used
in many ways – tutorials, classes …
MASSIVE OPEN ONLINE COURSES (MOOC)
11/26/2014
Course Motivation
133
11/26/2014134
Course Motivation
http://x-informatics.appspot.com/course
One of 2
Original IU
SOIC
MOOCs
11/26/2014135
Seven ~10 minutes lesson objects in this lecture
Started using closed caption but gave up
Could be useful if in other languages
Course Motivation
• We could teach one class to 100,000 students or 2,000
classes to 50 students
• The 2,000 class choice has 2 useful features
• One can use the usual (electronic) mentoring/grading technology
• One can customize each of 2,000 classes for a particular audience
given their level and interests
• One can even allow student to customize – that’s what one does in
making play lists in iTunes
• Both models can be supported by a repository of lesson
objects (10-15 minute video segments) in the cloud
• The teacher can choose from existing lesson objects and
add their own to produce a new customized course with
new lessons contributed back to repository
CUSTOMIZABLE MOOC’S I
11/26/2014
Course Motivation
136
11/26/2014137
Course Motivation
http://iucloudsummerschool.appspot.com/preview
Unit ~1 hour with ~6 lessons,
Total 115 lesson objects
SCIENCE CLOUD MOOC
REPOSITORY
• The 3-15 minute Video over PowerPoint of MOOC
lesson object’s is easy to re-use
• Qiu (IU)and Hayden (ECSU Elizabeth City State
University – (a small HBCU Historically Black
University) will customize a module
• Starting with Qiu’s cloud computing course at IU
• Adding material on use of Cloud Computing in Remote
Sensing (area covered by ECSU course)
• This is a model for adding cloud curricula material
to wide set of universities where faculty not able to
teach
• Defining how to support computing labs
associated with MOOC’s with clouds or VM’s on
clients
• Appliances scale as download to student’s client
CUSTOMIZABLE MOOC’S II
11/26/2014
Course Motivation
138
11/26/2014139
Course Motivation
139
Can of course
build many
different
interfaces
Songs stored
on YouTube
Songs
prepared with
Adobe
Presenter on
Laptop
http://cloudmooc.soic.indiana.edu/
• High volume courses (CS/Ph/Chem/Bio101…)
where scalability of MOOC’s make them attractive
to reach a lot of students
• Niche areas where there is some student interest
but either no faculty expertise or not enough
students to justify traditional courses
• Offer to many institutions simultaneously
TWO LIMITS WHERE MOOC’S ARE
COMPELLING
11/26/2014
Course Motivation
140
• I proposed in 1999 (misjudging pace of online education): One
can place faculty in their favorite location (e.g. remote cave)
and universities provide structure to give “credentials” while
faculty teach
• Maybe some populate repository; others actually deliver courses
• Note Google Course Builder or Microsoft Mix have no need for
local resources except for faculty client (laptop) – entire course
stored in the cloud
• So we can radically change university system with a major cross
institution virtual education and
as well research component
• Enabled by cloud plus high
performance reliable networking
• Lets set up community to define
and build the virtual university
MOOC repository and other
needed activities
HERMIT’S CAVE VIRTUAL UNIVERSITY
11/26/2014
Course Motivation
141
• We should aim at simplicity; attractive at moment is a mix
of multi-topic forum plus more interactive Hangouts or
equivalent (5-12 people)
MENTORING / GRADING
https://www.youtube.com/watch?v=M3jcSCA9_hM
11/26/2014
Course Motivation
142
11/26/2014143
Course Motivation
Meeker/Wu May 29 2013 Internet
Trends D11 Conference
• Use clouds in faculty, graduate student and
undergraduate research
• Clouds for Scientific data analysis
• Core cloud technologies (MapReduce,
Virtualization)
• Teach clouds as it involves areas of Information
Technology with lots of job opportunities
• Use clouds to support distributed learning and
research environment
• A cloud backend for course materials and
collaboration as in MOOC repository
• Green environmentally friendly computing
infrastructure
MANY SYNERGIES CLOUDS AND
UNIVERSITIES
11/26/2014
Course Motivation
144
Course Motivation
CONCLUSIONS
11/26/2014145
• Clouds are here to stay and one should plan on exploiting
them
• Data Intensive studies in business and research continue to
grow in importance
• Data Analytics: Everything is an optimization problem in a funny
space
• Growing employment opportunities in clouds and data
related activities and so popular with students
• Enabling many of the most important companies from
Facebook/Google to General Electric
• Need community discussion of data science education
• Agree on curricula; is such a degree attractive?
• MOOC’s interesting for
• Disseminating new curricula
• Managing course fragments that can be assembled into
custom courses for particular interdisciplinary students
CONCLUSIONS
11/26/2014
Course Motivation
146
• Use Clouds running Data
Analytics Collaboratively
processing Big Data to solve
problems in X-Informatics
educated in data science
• X = Astronomy, Biology, Biomedicine, Business, Chemistry,
Climate, Crisis, Earth Science, Energy, Environment, Finance,
Health, Intelligence, Lifestyle, Marketing, Medicine, Pathology,
Policy, Radar, Security, Sensor, Social, Sustainability, Wealth and
Wellness with more fields (physics) defined implicitly
• Spans Industry and Science (research)
BIG DATA ECOSYSTEM IN ONE SENTENCE
11/26/2014
Course Motivation
147

Mais conteúdo relacionado

Mais procurados

Transforming Big Data into Smart Data for Smart Energy: Deriving Value via ha...
Transforming Big Data into Smart Data for Smart Energy: Deriving Value via ha...Transforming Big Data into Smart Data for Smart Energy: Deriving Value via ha...
Transforming Big Data into Smart Data for Smart Energy: Deriving Value via ha...
Amit Sheth
 
Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...
Amit Sheth
 
DataViz_What_How_Why
DataViz_What_How_WhyDataViz_What_How_Why
DataViz_What_How_Why
Shweta Gupte
 
Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Soci...
Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Soci...Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Soci...
Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Soci...
Artificial Intelligence Institute at UofSC
 

Mais procurados (20)

Data Innovation: Generating Climate Solutions Event
Data Innovation: Generating Climate Solutions EventData Innovation: Generating Climate Solutions Event
Data Innovation: Generating Climate Solutions Event
 
Transforming Big Data into Smart Data for Smart Energy: Deriving Value via ha...
Transforming Big Data into Smart Data for Smart Energy: Deriving Value via ha...Transforming Big Data into Smart Data for Smart Energy: Deriving Value via ha...
Transforming Big Data into Smart Data for Smart Energy: Deriving Value via ha...
 
TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...
TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...
TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...
 
Artificial Intelligence (AI) and Climate Change
Artificial Intelligence (AI) and Climate ChangeArtificial Intelligence (AI) and Climate Change
Artificial Intelligence (AI) and Climate Change
 
Geospatial Information Management
Geospatial Information ManagementGeospatial Information Management
Geospatial Information Management
 
Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...
 
The GDELT project
The GDELT project The GDELT project
The GDELT project
 
Data Science Perspective and DS demo
Data Science Perspective and DS demo Data Science Perspective and DS demo
Data Science Perspective and DS demo
 
"Where the Internet lives" – performing the material spaces of the "immateria...
"Where the Internet lives" – performing the material spaces of the "immateria..."Where the Internet lives" – performing the material spaces of the "immateria...
"Where the Internet lives" – performing the material spaces of the "immateria...
 
DataViz_What_How_Why
DataViz_What_How_WhyDataViz_What_How_Why
DataViz_What_How_Why
 
EGU 2018 Ian McHarg Lecture
EGU 2018 Ian McHarg LectureEGU 2018 Ian McHarg Lecture
EGU 2018 Ian McHarg Lecture
 
Smart IoT for Connected Manufacturing
Smart IoT for Connected ManufacturingSmart IoT for Connected Manufacturing
Smart IoT for Connected Manufacturing
 
Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Soci...
Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Soci...Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Soci...
Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Soci...
 
Semantics-empowered Smart City applications: today and tomorrow
Semantics-empowered Smart City applications: today and tomorrowSemantics-empowered Smart City applications: today and tomorrow
Semantics-empowered Smart City applications: today and tomorrow
 
UN Global Pulse Annual Report 2017
UN Global Pulse Annual Report 2017UN Global Pulse Annual Report 2017
UN Global Pulse Annual Report 2017
 
(Big) Data Analytics for Environmental Sustainability
(Big) Data Analytics for Environmental Sustainability(Big) Data Analytics for Environmental Sustainability
(Big) Data Analytics for Environmental Sustainability
 
Leveraging technology in disaster management
Leveraging technology in disaster managementLeveraging technology in disaster management
Leveraging technology in disaster management
 
Multipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendationMultipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendation
 
Micro reports and Situation Recognition at social machines workshop
Micro reports and Situation Recognition at social machines workshopMicro reports and Situation Recognition at social machines workshop
Micro reports and Situation Recognition at social machines workshop
 
Gender Equality and Big Data. Making Gender Data Visible
Gender Equality and Big Data. Making Gender Data Visible Gender Equality and Big Data. Making Gender Data Visible
Gender Equality and Big Data. Making Gender Data Visible
 

Destaque

Raptor OEM December 2017
Raptor OEM December 2017Raptor OEM December 2017
Raptor OEM December 2017
Mark Donaghy
 
Comunicare de acceptare a ofertelor de vanzari terenuri
Comunicare de acceptare a ofertelor de vanzari terenuriComunicare de acceptare a ofertelor de vanzari terenuri
Comunicare de acceptare a ofertelor de vanzari terenuri
Mîneran Sergiu
 

Destaque (20)

Mind with regard to Meaning
Mind with regard to MeaningMind with regard to Meaning
Mind with regard to Meaning
 
Lets booklet
Lets bookletLets booklet
Lets booklet
 
Mba cm me-introduction
Mba cm me-introductionMba cm me-introduction
Mba cm me-introduction
 
Collaborative design abc2014winter
Collaborative design abc2014winterCollaborative design abc2014winter
Collaborative design abc2014winter
 
Key aggregate cryptosystem for scalable data sharing in cloud storage
Key aggregate cryptosystem for scalable data sharing in cloud storageKey aggregate cryptosystem for scalable data sharing in cloud storage
Key aggregate cryptosystem for scalable data sharing in cloud storage
 
Resume_RB
Resume_RBResume_RB
Resume_RB
 
Raptor OEM December 2017
Raptor OEM December 2017Raptor OEM December 2017
Raptor OEM December 2017
 
Chemistry project on chemistry in everyday life
Chemistry project on chemistry in everyday lifeChemistry project on chemistry in everyday life
Chemistry project on chemistry in everyday life
 
Key determinants of the shadow banking system álvaro álvarez campana
Key determinants of the shadow banking system  álvaro álvarez campanaKey determinants of the shadow banking system  álvaro álvarez campana
Key determinants of the shadow banking system álvaro álvarez campana
 
Key determinants of shadow banking
Key determinants of shadow bankingKey determinants of shadow banking
Key determinants of shadow banking
 
Kmbt35020150406151645 rotated
Kmbt35020150406151645 rotatedKmbt35020150406151645 rotated
Kmbt35020150406151645 rotated
 
Talharii rotated
Talharii rotatedTalharii rotated
Talharii rotated
 
Oferta terenuri
Oferta terenuriOferta terenuri
Oferta terenuri
 
fvff
fvfffvff
fvff
 
Bilant 31.12.2014
Bilant 31.12.2014Bilant 31.12.2014
Bilant 31.12.2014
 
ZIARUL COMUNEI BOCSIG AUGUST 2016
ZIARUL COMUNEI BOCSIG AUGUST 2016ZIARUL COMUNEI BOCSIG AUGUST 2016
ZIARUL COMUNEI BOCSIG AUGUST 2016
 
ziar septembrie
ziar septembrieziar septembrie
ziar septembrie
 
Comunicare de acceptare a ofertelor de vanzari terenuri
Comunicare de acceptare a ofertelor de vanzari terenuriComunicare de acceptare a ofertelor de vanzari terenuri
Comunicare de acceptare a ofertelor de vanzari terenuri
 
Penjualan Paket Tour ke Gunung Sibayak
Penjualan Paket Tour ke Gunung SibayakPenjualan Paket Tour ke Gunung Sibayak
Penjualan Paket Tour ke Gunung Sibayak
 
Terenuri
TerenuriTerenuri
Terenuri
 

Semelhante a data about tech and data trends

Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
Jay Gendron
 
BIMCV: The Perfect "Big Data" Storm.
BIMCV: The Perfect "Big Data" Storm. BIMCV: The Perfect "Big Data" Storm.
BIMCV: The Perfect "Big Data" Storm.
maigva
 
Data accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphereData accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphere
Alex Hardisty
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
Rohit Dubey
 

Semelhante a data about tech and data trends (20)

Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
 
INN530 - Assignment 2, Big data and cloud computing for management
INN530 - Assignment 2, Big data and cloud computing for managementINN530 - Assignment 2, Big data and cloud computing for management
INN530 - Assignment 2, Big data and cloud computing for management
 
big_data_casestudies_2.ppt
big_data_casestudies_2.pptbig_data_casestudies_2.ppt
big_data_casestudies_2.ppt
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWG
 
Indexing the Real World Sensor Networks (at RE.WORK Internet of Things Summit...
Indexing the Real World Sensor Networks (at RE.WORK Internet of Things Summit...Indexing the Real World Sensor Networks (at RE.WORK Internet of Things Summit...
Indexing the Real World Sensor Networks (at RE.WORK Internet of Things Summit...
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
 
Key Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in EuropeKey Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in Europe
 
Rising tide of data update 20171024
Rising tide of data update 20171024Rising tide of data update 20171024
Rising tide of data update 20171024
 
Rising tide of data update
Rising tide of data update Rising tide of data update
Rising tide of data update
 
Network analysis
Network analysisNetwork analysis
Network analysis
 
Facing data sharing in a heterogeneous research community: lights and shadows...
Facing data sharing in a heterogeneous research community: lights and shadows...Facing data sharing in a heterogeneous research community: lights and shadows...
Facing data sharing in a heterogeneous research community: lights and shadows...
 
Progam slides | December 17, 2013 | Federal Cloud Computing Summit
Progam slides | December 17, 2013 | Federal Cloud Computing SummitProgam slides | December 17, 2013 | Federal Cloud Computing Summit
Progam slides | December 17, 2013 | Federal Cloud Computing Summit
 
BIMCV: The Perfect "Big Data" Storm.
BIMCV: The Perfect "Big Data" Storm. BIMCV: The Perfect "Big Data" Storm.
BIMCV: The Perfect "Big Data" Storm.
 
Data accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphereData accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphere
 
SemWeb 4 Gov – opportunities and challenges
SemWeb 4 Gov – opportunities and challengesSemWeb 4 Gov – opportunities and challenges
SemWeb 4 Gov – opportunities and challenges
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaBIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computing
 
Viet stack 2nd meetup - BigData in Cloud Computing
Viet stack 2nd meetup - BigData in Cloud ComputingViet stack 2nd meetup - BigData in Cloud Computing
Viet stack 2nd meetup - BigData in Cloud Computing
 

Último

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Último (20)

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 

data about tech and data trends

  • 1. BIG DATA APPLICATIONS & ANALYTICS MOTIVATION: BIG DATA AND THE CLOUD; CENTERPIECES OF THE FUTURE ECONOMY 11/26/2014Course Motivation 1 Geoffrey Fox November 25 2014 gcf@indiana.edu http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington
  • 2. Course Motivation General Remarks including Hype curves INTRODUCTION 11/26/20142
  • 3. • There is an endlessly growing amount of data as we record every transaction between people and the environment (whether shopping or on a social networking site) while smart phones, smart homes, ubiquitous cities, smart power grids, and intelligent vehicles deploy sensors recording even more. • Science with satellites and accelerators is giving data on transactions of particles and photons at the microscopic scale. • This data are and will be stored in immense clouds with co- located storage and computing that perform "analytics" that transform data into information and then to wisdom and decisions; data mining finds the proverbial knowledge diamonds in the data rough. • This disruptive transformation is driving the economy and creating millions of jobs in the emerging area of "data science". • We discuss this revolution and its implications for universities and society ABSTRACT 11/26/2014 Course Motivation 3
  • 4. The Data Deluge is clear trend from Commercial (Amazon, e-commerce) , Community (Facebook, Search) and Scientific applications Smaller (INTEL/ARM/AMD) chips drive Multicore (i.e. more computing) on shared servers Smaller Light weight clients from smartphones, tablets to sensors (i.e. more clients) Clouds with cheaper, greener, easier to use IT for applications New jobs associated with new curricula Clouds as a distributed system (changing a classic CS course) Data Science (new area) SOME TRENDS 11/26/2014 Course Motivation 4
  • 5. 48 technologies are listed in this year’s hype cycle which is the highest in last ten years. Year 2008 was the lowest (27) Gartner Says in 2012: We are at an interesting moment — a time when the scenarios we’ve been talking about for a long time are almost becoming reality. 11/26/20145 Course Motivation
  • 6. 11/26/20146 Course Motivation Private Cloud Computing is off the chart http://public.brighttalk.com/resource/core/19507/august_21_hype_cycle_fenn_lehong_29685.pdf
  • 7. 11/26/20147 GARTNER EMERGING TECHNOLOGY HYPE CYCLE 2014 Course Motivation
  • 10. 11/26/201410 Course Motivation Note number of “analytics” areas http://public.brighttalk.com/resource/core/19507/august_21_hype_cycle_fenn_lehong_29685.pdf
  • 12. • Economic Imperative: There are a lot of data and a lot of jobs • Computing Model: Industry adopted clouds which are attractive for data analytics • Research Model: 4th Paradigm; From Theory to Data driven science? • Research/Business opportunities in advancing computing technologies and algorithms • Research/Business opportunities in X-Informatics: applying 4th paradigm (more here!) • Development in Data Science Education: opportunities at universities ISSUES OF IMPORTANCE 11/26/2014 Course Motivation 12
  • 14. http://www.kpcb.com/internet-trends My Research focus is Science Big Data but note Note largest science ~100 petabytes = 0.000025 total Note 7 ZB (7. 1021) is about a terabyte (1012) for each person in world Zettabyte ~1010 Typical Local Storage (100 Gigabytes) Zettabyte = 1000 Exabytes Exabyte = 1000 Petabytes Petabyte = 1000 Terabyte Terabyte = 1000 Gigabytes Gigabyte = 1000 Megabytes 11/26/201414 Course Motivation
  • 17. 11/26/201417 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference 20 hours https://www.youtube.com/yt/press/statistics.html
  • 21. • Web Data (“the original big data”) • Analyze customer web browsing of e-commerce site to see topics looked at etc. • Auto Insurance (telematics monitoring driving) • Equip cars with sensors • Text data in multiple industries • Sentiment analysis, identify common issues (as in eBay lamp example), Natural Language processing • Time and location (GPS) data • Track trucks (delivery), vehicles(track), people(tell them nearby goodies) • Retail and manufacturing: RFID • Asset and inventory management, • Utility industry: Smart Grid • Sensors allow dynamic optimization of power • Gaming industry: Casino Chip tracking (RFID) • Track individual players, detect fraud, identify patterns • Industrial engines and equipment: sensor data • See GE engine • Video games: telemetry • This is like monitoring web browsing but rather monitor actions in a game • Telecommunication and other industries: Social Network data • Connections make this big data. • Use connections to find new customers with similar interests “TAMING THE BIG DATA TIDAL WAVE” 2012 (BILL FRANKS, CHIEF ANALYTICS OFFICER TERADATA) 11/26/2014 Course Motivation 21
  • 22. Ruh VP Software GE http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html 11/26/201422 Course Motivation
  • 23. Ruh VP Software GE http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html MM = Million 11/26/201423 Course Motivation
  • 24. • LHC Particle Physics 15 petabytes per year • Radiology 69 petabytes per year • Square Kilometer Array Telescope will be 0.5 zettabytes per year raw data in ~2022 • Earth Observation becoming ~4 petabytes per year • Earthquake Science – few terabytes total today • PolarGrid Radar studies of glaciers– 100’s terabytes/year • Exascale simulation data dumps – ~0.1 zettabyte per year SOME SCIENCE/TECHNICAL DATA SIZES 11/26/2014 Course Motivation 24
  • 25. http://www.kpcb.com/internet-trends http://www.genome.gov/sequencingcosts/ Need cost effective Computing! Sequence every newborn by 2019 gives 100 petabytes/year 11/26/201425 Course Motivation
  • 26. The Long Tail of Science 80-20 rule: 20% users generate 80% data but not necessarily 80% knowledge Collectively “long tail” science is generating a lot of data Estimated at over 1PB per year and it is growing fast. CSTI Meeting. October 2012 Dennis Gannon 11/26/201426 Course Motivation
  • 27. • Particle Physics LHC (bag of events of particles) • Information Retrieval or web search (bag of words) • e-commerce (bag of items with properties or users with rankings) • Social Networking (bag of people with links & properties) • Health Informatics (bag of health records, gene sequences) • Sensors – web cams, self driving cars etc. (bag of pixels) • Using • Statistics (Histograms, Chisq) • Deep Learning (Machine Learning) • Image Analysis (including internet uploaded images) • Recommender Engines (Bag of Ratings or properties) • Patterns or Anomaly detection in graphs (linked data) • On Clouds using MapReduce etc. DATA INTENSIVE ACTIVITIES Bag= Space 11/26/2014 Course Motivation 27
  • 28. • Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics ( or e-X) • X = Astronomy, Biology, Biomedicine, Business, Chemistry, Climate, Crisis, Earth Science, Energy, Environment, Finance, Health, Intelligence, Lifestyle, Marketing, Medicine, Pathology, Policy, Radar, Security, Sensor, Social, Sustainability, Wealth and Wellness with more fields (physics) defined implicitly • Spans Industry and Science (research) • Education: Data Science see recent New York Times articles • http://datascience101.wordpress.com/2013/04/13/new- york-times-data-science-articles/ BIG DATA ECOSYSTEM IN ONE SENTENCE 11/26/2014 Course Motivation 28
  • 30. Course Motivation Data Science, Clouds and Computer Science JOBS 11/26/201430
  • 31. JOBS V. COUNTRIES 11/26/201431 Course Motivation http://www.microsoft.com/en-us/news/features/2012/mar12/03- 05CloudComputingJobs.aspx
  • 32. • There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions. • Informatics aimed at 1.5 million jobs. Computer Science covers the 140,000 to 190,000 MCKINSEY INSTITUTE ON BIG DATA JOBS Course Motivation 3211/26/2014 http://www.mckinsey.com/mgi/publications/ big_data/index.asp.
  • 33. Tom Davenport Harvard Business School http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html Nov 2012 11/26/201433 Course Motivation
  • 34. 11/26/201434 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 35. 11/26/201435 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 36. Course Motivation Many Technology trends INDUSTRY TRENDS 11/26/201436
  • 37. http://www.kpcb.com/internet-trends Note that translates NOW into smaller devices In PAST translated into faster devices of same form factor 11/26/201437 Course Motivation
  • 40. 11/26/201440 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 41. 11/26/201441 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 42. 11/26/201442 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 43. 11/26/201443 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 44. 11/26/201444 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 45. 11/26/201445 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 49. 11/26/201449 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 52. Course Motivation Displaced by Digital Disruption THE PAST? 11/26/201452
  • 53. 11/26/201453 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 56. WHERE ARE SHOPPERS GOING? 11/26/201456 Course Motivation
  • 58. 11/26/201458 E-COMMERCE IS DRIVING NEARLY ALL RETAIL GROWTH IN US Course Motivation
  • 59. 11/26/201459 1 IN 20 RETAIL DOLLARS ARE ALREADY ONLINE Course Motivation
  • 60. Even online groceries taking off 11/26/201460 Course Motivation
  • 61. Course Motivation Industry adopted clouds which are attractive for data analytics COMPUTING MODEL 11/26/201461
  • 62. For last 5 years Cloud Computing and last 2 years Big Data Transformational Note in 2013 Big Data moves to 5-10 year slot 11/26/201462 Course Motivation
  • 63. • It took Amazon Web Services (AWS) eight years to hit $650 million in revenue, according to Citigroup in 2010. • Just three years later, Macquarie Capital analyst Ben Schachter estimates that AWS will top $3.8 billion in 2013 revenue, up from $2.1 billion in 2012 (estimated), valuing the AWS business at $19 billion. • First public cloud computing supplier building on many cloud systems used to run Amazon, Google, Bing, eBay …. AMAZON CLOUD AWS MAKING MONEY 11/26/2014 Course Motivation 63
  • 64. • A bunch of computers in an efficient data center with an excellent Internet connection • They were produced to meet need of public- facing Web 2.0 e-Commerce/Social Networking sites • They can be considered as “optimal giant data center” plus internet connection • Note enterprises use private clouds that are giant data centers but not optimized for Internet access OPERATIONALLY CLOUDS ARE CLEAR 11/26/2014 Course Motivation 64
  • 65. THE MICROSOFT CLOUD IS BUILT ON DATA CENTERS 11/26/201465 Course Motivation Quincy, WA Chicago, IL San Antonio, TX Dublin, Ireland Generation 4 DCs Range in size from “edge” facilities to megascale (100K to 1M servers). Giant data centers with ~ 200-1000 to a shipping container with Internet access. ~100 Globally Distributed Data Centers CSTI Meeting. October 2012 Dennis Gannon
  • 66. DATA CENTERS CLOUDS & ECONOMIES OF SCALE 11/26/201466 Range in size from “edge” facilities to megascale. Economies of scale: Approximate costs for a small size center (1K servers) and a larger, 50K server center. Course Motivation Each data center is 11.5 times the size of a football field Technology Cost in small- sized Data Center Cost in Large Data Center Ratio Network $95 per Mbps/ month $13 per Mbps/ month 7.1 Storage $2.20 per GB/ month $0.40 per GB/ month 5.7 Administration ~140 servers/ Administrator >1000 Servers/ Administrator 7.1 http://research.microsoft.com/en-us/people/barga/sc09_cloudcomp_tutorial.pdf 2 Google warehouses of computers on the banks of the Columbia River, in The Dalles, Oregon Such centers use 20MW-200MW (Future) each with 150 watts per CPU Save money from large size, positioning with cheap power and access with Internet
  • 67. • Virtualization = abstraction; run a job – you know not where • Virtualization = use hypervisor to support “images” • Allows you to define complete job as an “image” – OS + application • Efficient packing of multiple applications into one server as they don’t interfere (much) with each other if in different virtual machines; • They interfere if put as two jobs in same machine as for example must have same OS and same OS services • Also security model between VM’s more robust than between processes VIRTUALIZATION MADE SEVERAL THINGS MORE CONVENIENT 11/26/2014 Course Motivation 67
  • 68. • http://research.microsoft.com/pubs/78813/AJ18_E N.pdf • Typical data center CPU has 9.75% utilization • Take 5000 SQL servers and rehost on virtual machines with 6:1 consolidation MICROSOFT SERVER CONSOLIDATION 60% saving 11/26/2014 Course Motivation 68
  • 69. • http://www.google.com/green/pdfs/google- green-computing.pdf • Clouds win by efficient resource use and efficient data centers THE GOOGLE GMAIL EXAMPLE Business Type Number of users # servers IT Power per user PUE (Power Usage effectivene ss) Total Power per user Annual Energy per user Small 50 2 8W 2.5 20W 175 kWh Medium 500 2 1.8W 1.8 3.2W 28.4 kWh Large 10000 12 0.54W 1.6 0.9W 7.6 kWh Gmail (Cloud)   < 0.22W 1.16 < 0.25W < 2.2 kWh 11/26/2014 Course Motivation 69
  • 70. • Features from NIST: • On-demand service (elastic); • Broad network access; • Resource pooling; • Flexible resource allocation; • Measured service • Economies of scale in performance and electrical power (Green IT) • Powerful new software models • Platform as a Service is not an alternative to Infrastructure as a Service – it is instead an incredible valued added • Amazon is as much PaaS as Azure • They are cheaper than classic clusters unless latter 100% utilized CLOUDS OFFER FROM DIFFERENT POINTS OF VIEW 11/26/2014 Course Motivation 70
  • 71. BPM = Business Process management IaaS Hardware e.g. Server PaaS Systems Services e.g. MapReduce, Database SaaS Applications e.g. Recommender System BPaaS Particular Application Set 11/26/201471 Course Motivation
  • 73. Course Motivation 4th Paradigm; From Theory to Data driven science? RESEARCH MODEL 11/26/201473
  • 75. 1. Theory 2. Experiment or Observation • E.g. Newton observed apples falling to design his theory of mechanics 3. Simulation of theory or model Supercomputers 4. Data-driven (Big Data) or The Fourth Paradigm: Data-Intensive Scientific Discovery (aka Data Science) • http://research.microsoft.com/en- us/collaboration/fourthparadigm/ A free book • More data; less models THE 4 PARADIGMS OF SCIENTIFIC RESEARCH 11/26/2014 Course Motivation 75
  • 76. MORE DATA USUALLY BEATS BETTER ALGORITHMS Course Motivation 7611/26/2014 • Here's how the competition works. Netflix has provided a large data set that tells you how nearly half a million people have rated about 18,000 movies. Based on these ratings, you are asked to predict the ratings of these users for movies in the set that they have not rated. The first team to beat the accuracy of Netflix's proprietary algorithm by a certain margin wins a prize of $1 million! • Different student teams in my class adopted different approaches to the problem, using both published algorithms and novel ideas. Of these, the results from two of the teams illustrate a broader point. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database(IMDB). Guess which team did better? • Anand Rajaraman is Senior Vice President at Walmart Global eCommerce, where he heads up the newly created @WalmartLabs, • http://anand.typepad.com/datawocky/2008/03/more-data-usual.html • 20120117berkeley1.pdf Jeff Hammerbacher
  • 78. • Data becomes • Information becomes • Knowledge becomes • Wisdom or Decisions • Community acceptance of results or approach important here • Volume of bits&bytes decreases as we proceed down DIKW pipeline DIKW PROCESS 11/26/2014 Course Motivation 78
  • 79. 79 Course Motivation Database SS SS SS SS SS SS SS: Sensor or Data Interchange Service Workflow through multiple filter/discovery clouds Another Cloud Raw Data  Data  Information  Knowledge  Wisdom  Decisions SSSS Another Service SS Another Grid SS SS SS SS SS SS SS SS Storage Cloud Compute Cloud SS SSSS SS Filter Cloud Filter Cloud Filter Cloud Discovery Cloud Discovery Cloud Filter Cloud Filter Cloud Filter Cloud SS Filter Cloud Filter Cloud Filter Cloud Filter Cloud Distributed Grid Hadoop Cluster SS Data Deluge is also Information/Knowledge/Wisdom/Decision Deluge? 11/26/2014
  • 80. • Data comes from traditional maps (US Geological Survey), Satellites (overlays) and street cams • Information is presented by basic Google Maps web page • Knowledge is a particular optimized route • Decisions (Wisdom) comes from deciding to drive a particular route EXAMPLE OF GOOGLE MAPS/NAVIGATION 11/26/2014 Course Motivation 80
  • 81. Course Motivation Application Example PHYSICS-INFORMATICS LOOKING FOR HIGGS PARTICLE WITH LARGE HADRON COLLIDER LHC 11/26/201481
  • 82. 11/26/201482 Course Motivation The LHC produces some 15 petabytes of data per year of all varieties and with the exact value depending on duty factor of accelerator (which is reduced simply to cut electricity cost but also due to malfunction of one or more of the many complex systems) and experiments. The raw data produced by experiments is processed on the LHC Computing Grid, which has some 200,000 Cores arranged in a three level structure. Tier-0 is CERN itself, Tier 1 are national facilities and Tier 2 are regional systems. For example one LHC experiment (CMS) has 7 Tier-1 and 50 Tier-2 facilities. This analysis raw data  reconstructed data  AOD and TAGS  Physics is performed on the multi-tier LHC Computing Grid. Note that every event can be analyzed independently so that many events can be processed in parallel with some concentration operations such as those to gather entries in a histogram. This implies that both Grid and Cloud solutions work with this type of data with currently Grids being the only implementation today. Higgs Event http://grids.ucs.indiana.edu/ptliupages/publications/Where%20does%20all%20the%20data%20come%20from%20v7.pdf Note LHC lies in a tunnel 27 kilometres (17 mi) in circumference ATLAS Expt
  • 83. http://www.interactions.org/cms/?pid=1032811 The inside of the RHIC (Relativistic Heavy Ion Collider) tunnel, a 2.4-mile high-tech particle racetrack at Brookhaven National Laboratory. 11/26/201483 Course Motivation
  • 85. • As a naïve undergraduate in 1964, I was told by Professor who later left university to enter church that bumps like were particles. I was amazed and found this more intriguing than anything else I had heard about so I decided to do PhD in particle physics. • I later decided computing moving faster than physics, so I went into Informatics • Also I was alarmed by size and time scale of physics activities • Note ATLAS is 45 metres long, 25 metres in diameter, and weighs about 7,000 tons. The experiment is a collaboration involving roughly 3,000 physicists at 175 institutions in 38 countries • US version of LHC, Superconducting Super Collider (SSC) discussed in 1983 was cancelled in 1993 after $2B spent PERSONAL NOTE 11/26/2014 Course Motivation 85
  • 88. • In many cases, one needs personalized matching of items to people or perhaps collections of items to collections of people • People to products: Online and Offline Commerce • People to People: Social Networking • People to Jobs or Employers: Job Sites • People+Queries to the Web: Information Retrieval (search as in Bing/Google) OVERVIEW OF MANY INFORMATICS AREAS 11/26/2014 Course Motivation 88
  • 89. • A large number of online and offline commerce activities plus basic Internet site personalization relies on “recommender systems” • Given real-time action by user, immediately suggest new actions (as in Amazon buy recommendations on web) • Based on past actions of users (and others) suggest movies to look at, restaurants to eat at, events to go to, books and music to buy • Based on mix of explicit user choice and grouping of internet sites, present customized Google News page • Given sales statistics, decide on discounts at “real” supermarkets and placement of related (by analysis of buying habits) products • Identify possible colleagues at Social Networking sites like LinkedIn • Identify matches between employers and employees at sites like CareerBuilder and Monster RECOMMENDER SYSTEMS IN MORE DETAIL 11/26/2014 Course Motivation 89
  • 90. • Fit Model to Data • Higgs + Background • Match User to Jobs or Books or Other Users? • Classification is optimizing assignment of members of an ontology (list of categories) to data • Typically minimize some function (or maximize negative of function) • Interesting feature of these problems is ingenious choice of function • Note Physics minimizes (free) energy • Often involves thinking of people and/or items as points in a space (not always a traditional vector space) • Space called “bag” in “bag of words” model for information retrieval EVERYTHING IS AN OPTIMIZATION PROBLEM? 11/26/2014 Course Motivation 90
  • 91. NETFLIX ON PERSONALIZATION 11/26/201491 Course Motivation http://www.slideshare.net/xamat/building-largescale- realworld-recommender-systems-recsys2012-tutorial
  • 92. NETFLIX ON RECOMMENDATIONS 11/26/201492 Course Motivation http://www.slideshare.net/xamat/building-largescale- realworld-recommender-systems-recsys2012-tutorial
  • 93. APRIL 2013: THE LAST TWO QUARTERS HAVE EACH BROUGHT MORE THAN 2 MILLION NEW STREAMING SUBSCRIBER SIGNUPS. THAT GIVES NETFLIX A CURRENT TOTAL OF NEARLY 29.2 MILLION SUBSCRIBERS 11/26/201493 Course Motivation http://www.slideshare.net/xamat/building-largescale-realworld-recommender- systems-recsys2012-tutorial
  • 95. NETFLIX ON DATA SCIENCE 11/26/201495 Course Motivation Note Netflix and others run tests all the time on subsets of customers http://www.slideshare.net/xamat/building-largescale-realworld-recommender- systems-recsys2012-tutorial
  • 96. Course Motivation Use of Data bags and spaces RECOMMENDER SYSTEMS II 11/26/201496
  • 97. • In user-based collaborative filtering, we can think of users in a space of dimension N where there are N items and M users. • Let i run over items and u over users • Then each user is represented as a vector Ui(u) in “item- space” where ratings are vector components. We are looking for users u u’ that are near each other in this space as measured by some distance between Ui(u) and Ui(u’) • If u and u’ rate all items then these are “real” vectors but almost always they each only rates a small fraction of items and the number in common is even smaller • The “Pearson coefficient” is just one distance measure that can be used • Only sum over i rated by u and u’ DISTANCES IN FUNNY SPACES I Last.fm uses this for songs as does Amazon, Netflix11/26/2014 Course Motivation 97
  • 98. • In item-based collaborative filtering, we can think of items in a space of dimension M where there are N items and M users. • Let i run over items and u over users • Then each item is represented as a vector Ru(i) in “user- space” where ratings are vector components. We are looking for items i i’ that are near each other in this space as measured by some distance between Ru(i) and Ru(i’) • If i and i’ rated by all users then these are “real” vectors but almost always they are each only rated by a small fraction of users and the number in common is even smaller • The “Cosine measure” is just one distance measure that can be used • Only sum over users u rating both i and i’ DISTANCES IN FUNNY SPACES II 11/26/2014 Course Motivation 98
  • 99. • In content based recommender systems, we can think of items in a space of dimension M where there are N items and M properties. • Let i run over items and p over properties • Then each item is represented as a vector Pp(i) in “property-space” where values of properties are vector components. We are looking for items i i’ that are near each other in this space as measured by some distance between Pp(i) and Rp(i’) • Properties could be “reading age” or “character highlighted” or “author” for books • Properties can be genre or artist for songs and video • Properties can characterize pixel structure for images used in face recognition, driving etc. DISTANCES IN FUNNY SPACES III Pandora uses this for songs (Music Genome) as does Amazon, Netflix 11/26/2014 Course Motivation 99
  • 100. • Much of (eCommerce/LifeStyle) Informatics involves “points” • Events in LHC analysis • Users (people) or items (books, jobs, music, other people) • These points can be thought of being in a “space” or “bag” • Set of all books • Set of all physics reactions • Set of all Internet users • However as in recommender systems where a given user only rates some items, we don’t know “full position” • However we can nearly always define a useful distance d(a,b) between points • Always d(a,b) >= 0 • Usually d(a,b) = d(b,a) • Rarely d(a,b) + d(b,c) >= d(a,c) Triangle Inequality DO WE NEED “REAL” SPACES? 11/26/2014 Course Motivation 100
  • 101. • The simplest way to use distances is “nearest neighbor algorithms” – given one point, find a set of points near it – cut off by number of identified nearby points and/or distance to initial point • Here point is either user or item • Another approach is divide space into regions (topics, latent factors) consisting of nearby points • This is clustering • Also other algorithms like Gaussian mixture models or Latent Semantic Analysis or Latent Dirichlet Allocation which use a more sophisticated model USING DISTANCES 11/26/2014 Course Motivation 101
  • 102. Course Motivation Another Example WEB SEARCH INFORMATION RETRIEVAL 11/26/2014102
  • 103. • Get the digital data (from web or from scanning) • Need to crawl web (? Solved “engineering” problem) • Preprocess data to get searchable things (words positions) • Form Inverted Index mapping words to documents • Typically use TF-IDF (term frequency, Inverse Document frequency) to quantify importance of word match • Rank relevance of documents: PageRank • Lots of technology for advertising, “reverse engineering” “preventing reverse engineering” • Clustering of documents into topics (as in Google News) “WEB DATA ANALYTICS” 11/26/2014 Course Motivation 103
  • 104. Size of face proportional to PageRank11/26/2014104 Course Motivation
  • 105. • Goal (Function to Optimize – Long Term dollars) • Serve the right item to a user in a given context to optimize long-term business objectives • A scientific discipline that involves • Large scale Machine Learning & Statistics • Offline Models (capture global & stable characteristics) • Online Models (incorporates dynamic components) • Explore/Exploit (active and adaptive experimentation) • Multi-Objective Optimization • Click-rates (CTR), Engagement, advertising revenue, diversity, etc • Inferring user interest • Constructing User Profiles • Natural Language Processing to understand content • Topics, “aboutness”, entities, follow-up of something, breaking news,… MODERN RECOMMENDATION SYSTEMS (FROM YAHOO) http://pages.cs.wisc.edu/~beechung/icml11-tutorial/11/26/2014 Course Motivation 105
  • 106. 11/26/2014106 Course Motivation Recommend applications Recommend search queries Recommend news article Recommend packages: Image Title, summary Links to other pages Pick 4 out of a pool of K K = 20 ~ 40 Dynamic Routes traffic other pages http://pages.cs.wisc.edu/~beechung/icml11-tutorial/
  • 107. • Simple version • I have a content module on my page, content inventory is obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to improve overall click-rate (CTR) on this module • More advanced • I got X% lift in CTR. But I have additional information on other downstream utilities (e.g. advertising revenue). Can I increase downstream utility without losing too many clicks? • Highly advanced • There are multiple modules running on my webpage. How do I perform a simultaneous optimization? SOME EXAMPLES FROM CONTENT OPTIMIZATION http://pages.cs.wisc.edu/~beechung/icml11-tutorial/11/26/2014 Course Motivation 107
  • 108. Course Motivation Science Clouds Internet of Things CLOUD APPLICATIONS IN RESEARCH 11/26/2014108
  • 109. • Large Scale Supercomputers – Multicore nodes linked by high performance low latency network • Increasingly with GPU enhancement • Suitable for highly parallel simulations • High Throughput Systems such as European Grid Initiative EGI or Open Science Grid OSG typically aimed at pleasingly parallel jobs • Can use “cycle stealing” • Classic example is LHC data analysis • Grids federate resources as in EGI/OSG or enable convenient access to multiple backend systems including supercomputers • Use Services (SaaS) • Portals make access convenient and • Workflow integrates multiple processes into a single job SCIENCE COMPUTING ENVIRONMENTS 11/26/2014 Course Motivation 109
  • 110. • Synchronization/communication Performance Grids > Clouds > Classic HPC Systems • Clouds naturally execute effectively Grid workloads but are less clear for closely coupled HPC applications • Classic HPC machines as MPI engines offer highest possible performance on closely coupled problems • The 4 forms of MapReduce/MPI 1) Map Only – pleasingly parallel 2) Classic MapReduce as in Hadoop; single Map followed by reduction with fault tolerant use of disk 3) Iterative MapReduce use for data mining such as Expectation Maximization in clustering etc.; Cache data in memory between iterations and support the large collective communication (Reduce, Scatter, Gather, Multicast) use in data mining 4) Classic MPI! Support small point to point messaging efficiently as used in partial differential equation solvers CLOUDS HPC AND GRIDS 11/26/2014 Course Motivation 110
  • 111. • Pleasingly (moving to modestly) parallel applications of all sorts with roughly independent data or spawning independent simulations • Long tail of science and integration of distributed sensors • Commercial and Science Data analytics that can use MapReduce (some of such apps) or its iterative variants (most other data analytics apps) • Which science applications are using clouds? • Venus-C (Azure in Europe): 27 applications not using Scheduler, Workflow or MapReduce (except roll your own) • 50% of applications on FutureGrid were from Life Science • Locally Lilly corporation is commercial cloud user (for drug discovery) but not IU Biology • But overall very little science use of clouds yet WHAT APPLICATIONS WORK IN CLOUDS 11/26/2014 Course Motivation 111
  • 112. • “Long tail of science” can be an important usage mode of clouds. • In some areas like particle physics and astronomy, i.e. “big science”, there are just a few major instruments generating now petascale data driving discovery in a coordinated fashion. • In other areas such as genomics and environmental science, there are many “individual” researchers with distributed collection and analysis of data whose total data and processing needs can match the size of big science. • Clouds can provide scaling convenient resources for this important aspect of science. • Can be map only use of MapReduce if different usages naturally linked e.g. exploring docking of multiple chemicals or alignment of multiple DNA sequences • Collecting together (summarizing) multiple “maps” is a Reduction PARALLELISM OVER USERS AND USAGES 11/26/2014 Course Motivation 112
  • 113. • It is projected that there will be 24-75 billion devices on the Internet by 2020. Most will be small sensors that send streams of information into the cloud where it will be processed and integrated with other streams and turned into knowledge that will help our lives in a multitude of small and big ways. • The cloud will become increasing important as a controller of and resource provider for the Internet of Things. • As well as today’s use for smart phone and gaming console support, “Intelligent River” “smart homes and grid” and “ubiquitous cities” build on this vision and we could expect a growth in cloud supported/controlled robotics. • Some of these “things” will be supporting science • Natural parallelism over “things” • “Things” are distributed and so form a Grid INTERNET OF THINGS AND THE CLOUD 11/26/2014 Course Motivation 113
  • 114. • We will look at several streaming examples later but most of the use cases involve this. Streaming can be seen in many ways • There are devices – The Internet of Things and various MEMS in smartphones • There are people tweeting or logging in to the cloud to search. These interactions are streaming • Apache Storm is critical software here to gather and integrate multiple erratic streams • Also important algorithm challenges to update quickly analyses with streamed data STREAMING CATEGORY http://www.kpcb.com/internet-trends 11/26/2014 Course Motivation 114
  • 116. 11/26/2014116 SENSORS (THINGS) AS A SERVICE Course Motivation Sensors as a Service Sensor Processing as a Service (could use MapReduce) A larger sensor ……… Output Sensor https://sites.google.com/site/opensourceiotcloud/ Open Source Sensor (IoT) Cloud
  • 117. 11/26/2014117 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 118. 11/26/2014118 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 119. 11/26/2014119 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 120. Course Motivation Software Ecosystems PARALLEL COMPUTING AND MAPREDUCE 11/26/2014120
  • 122. There are a lot of Big Data and HPC Software systems Challenge! Manage environment offering these different components 11/26/2014122 Course Motivation
  • 123. • We don’t need 266 software packages so can choose e.g. • Workflow: IPython, Pegasus or Kepler (replaced by tools like Tez?) • Data Analytics: Mahout, R, ImageJ, Scalapack • High level Programming: Hive, Pig • Parallel Programming model: Hadoop, Spark, Giraph (Twister4Azure, Harp), MPI; Storm, Kapfka or RabbitMQ (Sensors) • In-memory: Memcached • Data Management: Hbase, MongoDB, MySQL or Derby • Distributed Coordination: Zookeeper • Cluster Management: Yarn, Slurm • File Systems: HDFS, Lustre • DevOps: Cloudmesh, Chef, Puppet, Docker, Cobbler • IaaS: Amazon, Azure, OpenStack, Libcloud • Monitoring: Inca, Ganglia, Nagios MAYBE A BIG DATA INITIATIVE WOULD INCLUDE 11/26/2014 Course Motivation 123
  • 124. • HPC: Typically SPMD (Single Program Multiple Data) “maps” typically processing particles or mesh points interspersed with multitude of low latency messages supported by specialized networks such as Infiniband and technologies like MPI • Often run large capability jobs with 100K (going to 1.5M) cores on same job • National DoE/NSF/NASA facilities run 100% utilization • Fault fragile and cannot tolerate “outlier maps” taking longer than others • Clouds: MapReduce has asynchronous maps typically processing data points with results saved to disk. Final reduce phase integrates results from different maps • Fault tolerant and does not require map synchronization • Map only useful special case • HPC + Clouds: Iterative MapReduce caches results between “MapReduce” steps and supports SPMD parallel computing with large messages as seen in parallel kernels (linear algebra) in clustering and other data mining CLASSIC PARALLEL COMPUTING 11/26/2014 Course Motivation 124
  • 125. MAPREDUCE “FILE/DATA REPOSITORY” PARALLELISM 11/26/2014125 Course Motivation Instruments Disks Map1 Map2 Map3 Reduce Communication Map = (data parallel) computation reading and writing data Reduce = Collective/Consolidation phase e.g. forming multiple global sums as in histogram Portals /Users Iterative MapReduce Map Map Map Map Reduce Reduce Reduce
  • 126. SAM’S PROBLEM HTTP://WWW.SLIDESHARE.NET/ESALIYA/MAPREDUCE-IN-SIMPLE-TERMS 11/26/2014126 Course Motivation • Sam thought of “drinking” the apple  He used a to cut the and a to make juice.
  • 127. CREATIVE SAM 11/26/2014127 Course Motivation (<a’, > , <o’, > , <p’, > ) • Implemented a parallel version of his innovation (<a, > , <o, > , <p, > , …) Each input to a map is a list of <key, value> pairs Each output of slice is a list of <key, value> pairs Grouped by key Each input to a reduce is a <key, value-list> (possibly a list of these, depending on the grouping/hashing mechanism) e.g. <ao, ( …)> Reduced into a list of values The idea of Map Reduce in Data Intensive Computing A list of <key, value> pairs mapped into another list of <key, value> pairs which gets grouped by the key and reduced into a list of values
  • 128. Course Motivation Opportunities at Universities see New York Times articles http://datascience101.wordpress.com/2013/04/13/new-york-times-data-science- articles/ DATA SCIENCE EDUCATION 11/26/2014128
  • 129. • Broad Range of Topics from Policy to curation to applications and algorithms, programming models, data systems, statistics, and broad range of CS subjects such as Clouds, Programming, HCI, • Plenty of Jobs and broader range of possibilities than computational science but similar cosmic issues • What type of degree (Certificate, minor, track, “real” degree) • What implementation (department, interdisciplinary group supporting education and research program) DATA SCIENCE EDUCATION 11/26/2014 Course Motivation 129
  • 130. • Have set up certificate and Masters degree in data science • Joint between 3 units in School of Informatics and Computing: Computer Science, Informatics, Information & Library Science, and Statistics department in COAS College • Certificate online • Masters online or residential • Looking at version with Kelley with Business data analytics flavor • Attractive to offer online as few universities have this and so a potentially large audience outside IU AT INDIANA UNIVERSITY 11/26/2014 Course Motivation 130
  • 131. 11/26/2014131 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 132. 11/26/2014132 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 133. • MOOC’s are very “hot” these days with Udacity and Coursera as start-ups; perhaps over 100,000 participants • Relevant to Data Science as this is a new field with few courses at most universities • Typical model is collection of short prerecorded segments (talking head over PowerPoint) of length 3-15 minutes • This is Boredom limit http://blog.coursera.org/post/49750392396/on- the-topic-of-boredom • These “lesson objects” can be viewed as “songs” • Google Course Builder (python open source) builds customizable MOOC’s as “playlists” of “songs” • Tells you to capture all material as “lesson objects” • We are aiming to build a repository of many “songs”; used in many ways – tutorials, classes … MASSIVE OPEN ONLINE COURSES (MOOC) 11/26/2014 Course Motivation 133
  • 135. 11/26/2014135 Seven ~10 minutes lesson objects in this lecture Started using closed caption but gave up Could be useful if in other languages Course Motivation
  • 136. • We could teach one class to 100,000 students or 2,000 classes to 50 students • The 2,000 class choice has 2 useful features • One can use the usual (electronic) mentoring/grading technology • One can customize each of 2,000 classes for a particular audience given their level and interests • One can even allow student to customize – that’s what one does in making play lists in iTunes • Both models can be supported by a repository of lesson objects (10-15 minute video segments) in the cloud • The teacher can choose from existing lesson objects and add their own to produce a new customized course with new lessons contributed back to repository CUSTOMIZABLE MOOC’S I 11/26/2014 Course Motivation 136
  • 137. 11/26/2014137 Course Motivation http://iucloudsummerschool.appspot.com/preview Unit ~1 hour with ~6 lessons, Total 115 lesson objects SCIENCE CLOUD MOOC REPOSITORY
  • 138. • The 3-15 minute Video over PowerPoint of MOOC lesson object’s is easy to re-use • Qiu (IU)and Hayden (ECSU Elizabeth City State University – (a small HBCU Historically Black University) will customize a module • Starting with Qiu’s cloud computing course at IU • Adding material on use of Cloud Computing in Remote Sensing (area covered by ECSU course) • This is a model for adding cloud curricula material to wide set of universities where faculty not able to teach • Defining how to support computing labs associated with MOOC’s with clouds or VM’s on clients • Appliances scale as download to student’s client CUSTOMIZABLE MOOC’S II 11/26/2014 Course Motivation 138
  • 139. 11/26/2014139 Course Motivation 139 Can of course build many different interfaces Songs stored on YouTube Songs prepared with Adobe Presenter on Laptop http://cloudmooc.soic.indiana.edu/
  • 140. • High volume courses (CS/Ph/Chem/Bio101…) where scalability of MOOC’s make them attractive to reach a lot of students • Niche areas where there is some student interest but either no faculty expertise or not enough students to justify traditional courses • Offer to many institutions simultaneously TWO LIMITS WHERE MOOC’S ARE COMPELLING 11/26/2014 Course Motivation 140
  • 141. • I proposed in 1999 (misjudging pace of online education): One can place faculty in their favorite location (e.g. remote cave) and universities provide structure to give “credentials” while faculty teach • Maybe some populate repository; others actually deliver courses • Note Google Course Builder or Microsoft Mix have no need for local resources except for faculty client (laptop) – entire course stored in the cloud • So we can radically change university system with a major cross institution virtual education and as well research component • Enabled by cloud plus high performance reliable networking • Lets set up community to define and build the virtual university MOOC repository and other needed activities HERMIT’S CAVE VIRTUAL UNIVERSITY 11/26/2014 Course Motivation 141
  • 142. • We should aim at simplicity; attractive at moment is a mix of multi-topic forum plus more interactive Hangouts or equivalent (5-12 people) MENTORING / GRADING https://www.youtube.com/watch?v=M3jcSCA9_hM 11/26/2014 Course Motivation 142
  • 143. 11/26/2014143 Course Motivation Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 144. • Use clouds in faculty, graduate student and undergraduate research • Clouds for Scientific data analysis • Core cloud technologies (MapReduce, Virtualization) • Teach clouds as it involves areas of Information Technology with lots of job opportunities • Use clouds to support distributed learning and research environment • A cloud backend for course materials and collaboration as in MOOC repository • Green environmentally friendly computing infrastructure MANY SYNERGIES CLOUDS AND UNIVERSITIES 11/26/2014 Course Motivation 144
  • 146. • Clouds are here to stay and one should plan on exploiting them • Data Intensive studies in business and research continue to grow in importance • Data Analytics: Everything is an optimization problem in a funny space • Growing employment opportunities in clouds and data related activities and so popular with students • Enabling many of the most important companies from Facebook/Google to General Electric • Need community discussion of data science education • Agree on curricula; is such a degree attractive? • MOOC’s interesting for • Disseminating new curricula • Managing course fragments that can be assembled into custom courses for particular interdisciplinary students CONCLUSIONS 11/26/2014 Course Motivation 146
  • 147. • Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics educated in data science • X = Astronomy, Biology, Biomedicine, Business, Chemistry, Climate, Crisis, Earth Science, Energy, Environment, Finance, Health, Intelligence, Lifestyle, Marketing, Medicine, Pathology, Policy, Radar, Security, Sensor, Social, Sustainability, Wealth and Wellness with more fields (physics) defined implicitly • Spans Industry and Science (research) BIG DATA ECOSYSTEM IN ONE SENTENCE 11/26/2014 Course Motivation 147