SlideShare uma empresa Scribd logo
1 de 54
Baixar para ler offline
Big Data
Types of data and opportunities
Prof. Dr. Nikolaos (Nikos) Deligiannis
Email: ndeligia@vub.be
Twitter: @prof_ndeligia
2
Big Data: Big Challenges and Big Value
Big Data
Challenges
Volume
Veracity
VarietyVelocity
Value
Data Deluge
The Trend in the Job Market
Source: Indeed.com
Types of Data: Static vs Real-time
Part 1
Static Data
Medical Images
Road network information
Open Data
Static Data: Belgium OECD Data
Static Data
Road network
information
Actual data sample (GPX data).
Real-Time Data
Smart Mobility Smart Cities
Smart FinTech Smart Marketing
Real-Time Data
Positions of public transport vehicles
Real-Time Data
Public bicycle usage
Real-Time Data
[Link]
Real-time VR visualization of mobility and social
media data in Brussels.
[VR tool visualizing public transformation flows in Brussels; the
system enables the user to see on-the-fly the position of STIB
buses, the occupancy of Villo stations, geolocated social media
posts.]
15/20
Health Data Analytics
Data from epidemic web apps. [link]
Types of Data: Structure vs
Unstructured
Part 2
Structured Data
Phone addresses
IBAN bank codes
Product descriptions
Unstructured Data
Images & Video Audio
Text files (reports...)
Unstructured Data: Video
0 1 10 1 1 10
Size of Video Data
My Camera Specs.
§ 8 MP (3264×2448 pixels) Image
§ 640×480 pixels Video
§ 24 bits per pixel
§ 30 frames per second
Video from Milos à 5 min
Ø 8.3 GBs for storage
Internet Connection à 1 Mbps
Ø 18 hours for uploading
Type 3: Open vs Public vs Private data
Part 3
Private vs. Public
Extracting Value
Part 4
Regression: Predicting Second-Hand Car Prices
Mileage (km) in 1000’s
Price(euros)
3.000
6.000
9.000
12.000
15.000
20 40 60 80 100 120 14090
10.000
14.000
Supervised Learning
– Learn a model based on
labeled training data
Regression
– The predicted parameter is
continuous
25/20
Regression: Recommender Systems
Regression: Matrix Completion
predict movie ratings
Netflix: Users rate movies using a 0-5 star rating
Nikos (1) Eva (2) Duc (3) Tien (4)
P.S. I love you
Lord of the rings
Interstellar
Spectre
Crazy, stupid love
5
1
?
0
5
5
0
?
0
4
0
?
5
4
0
0
4
5
?
?
Classification: Sepsis Mortality Probability
APACHE II Score at Baseline
Survived 0
Supervised Learning
– Learn a model based on
labeled training data
Classification
– The predicted parameter is
discrete
Died 1
5 10 15 20 25 30 35 40 45
Clustering: The Pizza Hut Problem
Unsupervised Learning
– No labeled data available
Clustering
– Group the data
Dimensionality Reduction: Visualization
Unsupervised Learning
– No labeled data available
Dimensionality Reduction
– Find the latent dimensions of
the data
30/20
Data Visualization
3D visualization map of frequency of tweets in Brussels!
3D visualization map of the frequency of tweets in Brussels.
[Superposition of a high-resolution texture of the region, and a so-called height-map]
31/20
Topic Extraction on Social Media
Dominant media communities
on #Twitter in #Brussels
during June 2016 – January 2017
Visualization from 7Million tweets.
Twitter User Geolocation
Multiview deep learning architecture
S2 adaptive grid (Google S2 geometry library)
Geolocation accuracy
Significant gain in
geolocation accuracy
compared to the latest
approaches.
T. Do Huu, D. M. Nguyen, E. Tsiligianni, B. Cornelis, N. Deligiannis, “Twitter user geolocation
using deep multiview learning”, IEEE ICASSP 2018.
Image Analysis
True orthophoto Predicted Pixel label
Yu Liu, D. M. Nguyen et al. 2017
Cross-Modal Image-Text Retrieval
[Link]
Phrase localization in image
Example caption: A man with a goatee in a black shirt and
white latex gloves is using a tattoo on someone‘s back
Learning Problem Categories
Learning
Unsupervised LearningSupervised Learning
Regression Classification Clustering Dimensionality Reduction
Learn a model based
on labeled training
data
The predicted data
is continuous
The predicted data
is discrete
Cluster the data
into groups
Find lower
dimensions of the
data
No labeled training
data
What is the Learning Problem Category?
Google news
What is the Learning Problem Category?
Optical Character Recognition
What is the Learning Problem Category?
Predict the Total Amount of Sales in Oklahoma (OK)
State # malls Sales (m. $)
WA 630 15.5
NC 370 7.5
CA 616 13.9
UT 700 18.7
FL 430 8.2
IL 568 13.2
TX 1200 23
What is the Learning Problem Category?
Spam mail detector
Introduction to the Cloud
Part 5
Cloud Categories
Private cloud
(accessible only to company employees)
Public cloud
(service provided to any paying customer)
Amazon S3 (Simple storage service): store
arbitrary datasets, pay per GB-month stored
Amazon EC2 (Elastic Compute Cloud): upload
and run arbitrary OS images, pay per CPU hour
used
Google Compute Engine: develop applications
within their App Engine framework, upload data
that will be imported into their format, and run
Example of Cloud Architecture
Features in Today’s Cloud!
• Massive scale
• On-demand access
- Pay-as-you-go, no upfront commitment
- Anyone can access it
• Data-intensive applications
- MBs have become TBs, PBs and XBs
- Daily logs, forensics, web data, etc.
• New cloud programming paradigms
- MapReduce/Hadoop, NoSQL/Cassandra/MongoDB
- High in accessibility and ease of programmability
- Lots of open-source
Components of a Cloud
Servers (front) Servers (back)
Servers (inside) Servers (secure)
Powering a Cloud
Hydroelectric plants Thermoelectric plants
Photovoltaic plant
Features in Today’s Cloud!
• Massive scale
• On-demand access
- Pay-as-you-go, no upfront commitment
- Anyone can access it
• Data-intensive applications
- MBs have become TBs, PBs and XBs
- Daily logs, forensics, web data, etc.
• New cloud programming paradigms
- MapReduce/Hadoop, NoSQL/Cassandra/MongoDB
- High in accessibility and ease of programmability
- Lots of open-source
On-Demand Access
• On-demand access: like renting a car when needed
- AWS Elastic Compute Cloud (EC2) a few cents to a few USD
per CPU hour
- AWS simple storage service (S3) a few cents to a few USD per
GB-month
• HaaS: Hardware as a Service
- You get access to hardware machines, do whatever you
want with them (example, your own cluster)
- Security risks
• IaaS: Infrastructure as a Service
- You get access to flexible computing and storage
infrastructure. Virtualization or, for example a Linux
environment are ways to achieve this
- Examples: Amazon Web Services, Eucalyptus, Microsoft
Azure
On-Demand Access
• PaaS: Platform as a Service
- You get access to flexible computing and storage
infrastructure, together with a software platform.
- Example: Google AppEngine (Python, Java)
• SaaS: Software as a Service
- You get access to software services, when you need
them. Often subsumes Service Oriented Architectures
- Examples: Google docs, MS Office on demand
Data-Intensive Applications
• Computation-intensive computing
- Example areas: MPI-based,
high performance computing, grids
- Typically run on supercomputers
- the speed of supercomputers is benchmarked in "FLOPS" (FLoating point
Operations Per Second), and not in terms of "MIPS" (Million Instructions Per
Second), as for general-purpose computers
• Data-intensive computing
- Typically store data at datacenters
- Use compute nodes nearby
- Compute nodes run computation services
- The focus is on I/O operations (disk and/or network) not
on CPU utilization
New Cloud Programming Paradigms
Easy to write and run highly parallel programs in new cloud
programming paradigms:
• Google
- MapReduce and Sawzall
- MapReduce indexing a chain of 24 MapReduce jobs
- Approx. 200K jobs processing 50PB/month (in 2006)
• Amazon
- Elastic MapReduce service (pay-as-you-go)
• Yahoo!
- Hadoop + Pig
- WebMap a chain of 100 MapReduce jobs
- 280 TB of data, 2500 nodes
• Facebook
- Approx. 300TB total, adding 2TB/day (in 2008)
- 3K jobs processing 55TB/day
Cloud Categories
Private cloud
(accessible only to company employees)
Public cloud
(service provided to any paying customer)
If you are starting your own company should you use a
public cloud or purchase your own private cloud?
Power, cooling, management costs CPU usage, Storage usage
To Outsource Or Not
Medium-sized organization wishes to run a service for M
months
The services requires 128 servers (1024 cores) and 524
TB
• Outsource (e.g., AWS) [monthly cost]
- S3: $0.12 per GB/month – EC2: $0.10 per
CPU/hour
- Storage: $0.12 × 524 × 1000 ≈ $62.000
- Computation: $0.10 × 1024 × 24 × 30 ≈ $74.000
- Total: approx. $136.000
• Purchase [total cost]
- Storage: approx. $349.000
- Total: approx. $1.555.000 + $7.500 per month for a
system administrator per 100 nodes
To Outsource Or Not
Breakeven analysis à duration of usage defines the
• Outsource (e.g., AWS) [monthly cost]
- Storage: $0.12 × 524 × 1000 ≈ $62.000 per month
- Total: approx. $136.000 per month
• Purchase [total cost]
- Storage: approx. $349.000
- Total: approx. $1.555.000 + ($7.500 per month)
Breakeven points
• Storage: $349.000/$62.000 ≈ 5.55 months
• Total: $1.555.000/$136.000 ≈ 12 months
ü Startups use clouds a lot – they do not know how long
they will be in business
ü Cloud providers benefit monetarily more from storage

Mais conteúdo relacionado

Mais procurados

Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data TipsQubole
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamGreg Goltsov
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
Stanford DeepDive Framework
Stanford DeepDive FrameworkStanford DeepDive Framework
Stanford DeepDive FrameworkRan Zhang
 
Making Data Scientists Productive in Azure
Making Data Scientists Productive in AzureMaking Data Scientists Productive in Azure
Making Data Scientists Productive in AzureValdas Maksimavičius
 
Power of the Run Graph
Power of the Run GraphPower of the Run Graph
Power of the Run GraphVaticle
 
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, StealthLessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, StealthHostedbyConfluent
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
 
Big Data Platform Landscape by 2017
Big Data Platform Landscape by 2017Big Data Platform Landscape by 2017
Big Data Platform Landscape by 2017Donghui Zhang
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyNati Shalom
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Intro to Neo4j Webinar
Intro to Neo4j WebinarIntro to Neo4j Webinar
Intro to Neo4j WebinarNeo4j
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreSoftweb Solutions
 
Neo4j graphs in the real world - graph days d.c. - april 14, 2015
Neo4j   graphs in the real world - graph days d.c. - april 14, 2015Neo4j   graphs in the real world - graph days d.c. - april 14, 2015
Neo4j graphs in the real world - graph days d.c. - april 14, 2015Neo4j
 
Total Data Industry Report
Total Data Industry ReportTotal Data Industry Report
Total Data Industry ReportRan Zhang
 
Knowledge Architecture: Graphing Your Knowledge
Knowledge Architecture: Graphing Your KnowledgeKnowledge Architecture: Graphing Your Knowledge
Knowledge Architecture: Graphing Your KnowledgeNeo4j
 

Mais procurados (20)

Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Stanford DeepDive Framework
Stanford DeepDive FrameworkStanford DeepDive Framework
Stanford DeepDive Framework
 
Making Data Scientists Productive in Azure
Making Data Scientists Productive in AzureMaking Data Scientists Productive in Azure
Making Data Scientists Productive in Azure
 
Power of the Run Graph
Power of the Run GraphPower of the Run Graph
Power of the Run Graph
 
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, StealthLessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Big Data Platform Landscape by 2017
Big Data Platform Landscape by 2017Big Data Platform Landscape by 2017
Big Data Platform Landscape by 2017
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case Study
 
Exploring Big Data Analytics Tools
Exploring Big Data Analytics ToolsExploring Big Data Analytics Tools
Exploring Big Data Analytics Tools
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Intro to Neo4j Webinar
Intro to Neo4j WebinarIntro to Neo4j Webinar
Intro to Neo4j Webinar
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
 
Neo4j graphs in the real world - graph days d.c. - april 14, 2015
Neo4j   graphs in the real world - graph days d.c. - april 14, 2015Neo4j   graphs in the real world - graph days d.c. - april 14, 2015
Neo4j graphs in the real world - graph days d.c. - april 14, 2015
 
Total Data Industry Report
Total Data Industry ReportTotal Data Industry Report
Total Data Industry Report
 
Knowledge Architecture: Graphing Your Knowledge
Knowledge Architecture: Graphing Your KnowledgeKnowledge Architecture: Graphing Your Knowledge
Knowledge Architecture: Graphing Your Knowledge
 
Before Kaggle
Before KaggleBefore Kaggle
Before Kaggle
 

Semelhante a Course 3 : Types of data and opportunities by Nikolaos Deligiannis

Introduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big DataIntroduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big Datawaheed751
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.pptmohaaalsa
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.pptkesrinath
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.pptEcoSmith
 
cloud computing services
cloud computing servicescloud computing services
cloud computing servicesssuser55004a
 
Internet of behaviours features and documents
Internet of behaviours features and documentsInternet of behaviours features and documents
Internet of behaviours features and documentsAshwiniKumar27014
 
AWS res 2024 key points for better research.ppt
AWS res 2024 key points for better research.pptAWS res 2024 key points for better research.ppt
AWS res 2024 key points for better research.pptfodod37142
 
Cloud introduction2.ppt
Cloud introduction2.pptCloud introduction2.ppt
Cloud introduction2.pptBala Anand
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
 
云计算及其应用
云计算及其应用云计算及其应用
云计算及其应用lantianlcdx
 

Semelhante a Course 3 : Types of data and opportunities by Nikolaos Deligiannis (20)

Introduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big DataIntroduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big Data
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
Introduction To Cloud Computing.ppt
Introduction To Cloud Computing.pptIntroduction To Cloud Computing.ppt
Introduction To Cloud Computing.ppt
 
cloud computing services
cloud computing servicescloud computing services
cloud computing services
 
Internet of behaviours features and documents
Internet of behaviours features and documentsInternet of behaviours features and documents
Internet of behaviours features and documents
 
L2 3.fa19
L2 3.fa19L2 3.fa19
L2 3.fa19
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
AWS res 2024 key points for better research.ppt
AWS res 2024 key points for better research.pptAWS res 2024 key points for better research.ppt
AWS res 2024 key points for better research.ppt
 
Cloud introduction2.ppt
Cloud introduction2.pptCloud introduction2.ppt
Cloud introduction2.ppt
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
cloud.ppt
cloud.pptcloud.ppt
cloud.ppt
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
云计算及其应用
云计算及其应用云计算及其应用
云计算及其应用
 

Último

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Último (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Course 3 : Types of data and opportunities by Nikolaos Deligiannis

  • 1. Big Data Types of data and opportunities Prof. Dr. Nikolaos (Nikos) Deligiannis Email: ndeligia@vub.be Twitter: @prof_ndeligia
  • 2. 2 Big Data: Big Challenges and Big Value Big Data Challenges Volume Veracity VarietyVelocity Value
  • 4.
  • 5.
  • 6. The Trend in the Job Market Source: Indeed.com
  • 7. Types of Data: Static vs Real-time Part 1
  • 8. Static Data Medical Images Road network information Open Data
  • 11. Real-Time Data Smart Mobility Smart Cities Smart FinTech Smart Marketing
  • 12. Real-Time Data Positions of public transport vehicles
  • 14. Real-Time Data [Link] Real-time VR visualization of mobility and social media data in Brussels. [VR tool visualizing public transformation flows in Brussels; the system enables the user to see on-the-fly the position of STIB buses, the occupancy of Villo stations, geolocated social media posts.]
  • 15. 15/20 Health Data Analytics Data from epidemic web apps. [link]
  • 16. Types of Data: Structure vs Unstructured Part 2
  • 17. Structured Data Phone addresses IBAN bank codes Product descriptions
  • 18. Unstructured Data Images & Video Audio Text files (reports...)
  • 20. Size of Video Data My Camera Specs. § 8 MP (3264×2448 pixels) Image § 640×480 pixels Video § 24 bits per pixel § 30 frames per second Video from Milos à 5 min Ø 8.3 GBs for storage Internet Connection à 1 Mbps Ø 18 hours for uploading
  • 21. Type 3: Open vs Public vs Private data Part 3
  • 24. Regression: Predicting Second-Hand Car Prices Mileage (km) in 1000’s Price(euros) 3.000 6.000 9.000 12.000 15.000 20 40 60 80 100 120 14090 10.000 14.000 Supervised Learning – Learn a model based on labeled training data Regression – The predicted parameter is continuous
  • 26. Regression: Matrix Completion predict movie ratings Netflix: Users rate movies using a 0-5 star rating Nikos (1) Eva (2) Duc (3) Tien (4) P.S. I love you Lord of the rings Interstellar Spectre Crazy, stupid love 5 1 ? 0 5 5 0 ? 0 4 0 ? 5 4 0 0 4 5 ? ?
  • 27. Classification: Sepsis Mortality Probability APACHE II Score at Baseline Survived 0 Supervised Learning – Learn a model based on labeled training data Classification – The predicted parameter is discrete Died 1 5 10 15 20 25 30 35 40 45
  • 28. Clustering: The Pizza Hut Problem Unsupervised Learning – No labeled data available Clustering – Group the data
  • 29. Dimensionality Reduction: Visualization Unsupervised Learning – No labeled data available Dimensionality Reduction – Find the latent dimensions of the data
  • 30. 30/20 Data Visualization 3D visualization map of frequency of tweets in Brussels! 3D visualization map of the frequency of tweets in Brussels. [Superposition of a high-resolution texture of the region, and a so-called height-map]
  • 31. 31/20 Topic Extraction on Social Media Dominant media communities on #Twitter in #Brussels during June 2016 – January 2017 Visualization from 7Million tweets.
  • 32. Twitter User Geolocation Multiview deep learning architecture S2 adaptive grid (Google S2 geometry library) Geolocation accuracy Significant gain in geolocation accuracy compared to the latest approaches. T. Do Huu, D. M. Nguyen, E. Tsiligianni, B. Cornelis, N. Deligiannis, “Twitter user geolocation using deep multiview learning”, IEEE ICASSP 2018.
  • 33. Image Analysis True orthophoto Predicted Pixel label Yu Liu, D. M. Nguyen et al. 2017
  • 35. Phrase localization in image Example caption: A man with a goatee in a black shirt and white latex gloves is using a tattoo on someone‘s back
  • 36. Learning Problem Categories Learning Unsupervised LearningSupervised Learning Regression Classification Clustering Dimensionality Reduction Learn a model based on labeled training data The predicted data is continuous The predicted data is discrete Cluster the data into groups Find lower dimensions of the data No labeled training data
  • 37. What is the Learning Problem Category? Google news
  • 38. What is the Learning Problem Category? Optical Character Recognition
  • 39. What is the Learning Problem Category? Predict the Total Amount of Sales in Oklahoma (OK) State # malls Sales (m. $) WA 630 15.5 NC 370 7.5 CA 616 13.9 UT 700 18.7 FL 430 8.2 IL 568 13.2 TX 1200 23
  • 40. What is the Learning Problem Category? Spam mail detector
  • 41. Introduction to the Cloud Part 5
  • 42. Cloud Categories Private cloud (accessible only to company employees) Public cloud (service provided to any paying customer) Amazon S3 (Simple storage service): store arbitrary datasets, pay per GB-month stored Amazon EC2 (Elastic Compute Cloud): upload and run arbitrary OS images, pay per CPU hour used Google Compute Engine: develop applications within their App Engine framework, upload data that will be imported into their format, and run
  • 43. Example of Cloud Architecture
  • 44. Features in Today’s Cloud! • Massive scale • On-demand access - Pay-as-you-go, no upfront commitment - Anyone can access it • Data-intensive applications - MBs have become TBs, PBs and XBs - Daily logs, forensics, web data, etc. • New cloud programming paradigms - MapReduce/Hadoop, NoSQL/Cassandra/MongoDB - High in accessibility and ease of programmability - Lots of open-source
  • 45. Components of a Cloud Servers (front) Servers (back) Servers (inside) Servers (secure)
  • 46. Powering a Cloud Hydroelectric plants Thermoelectric plants Photovoltaic plant
  • 47. Features in Today’s Cloud! • Massive scale • On-demand access - Pay-as-you-go, no upfront commitment - Anyone can access it • Data-intensive applications - MBs have become TBs, PBs and XBs - Daily logs, forensics, web data, etc. • New cloud programming paradigms - MapReduce/Hadoop, NoSQL/Cassandra/MongoDB - High in accessibility and ease of programmability - Lots of open-source
  • 48. On-Demand Access • On-demand access: like renting a car when needed - AWS Elastic Compute Cloud (EC2) a few cents to a few USD per CPU hour - AWS simple storage service (S3) a few cents to a few USD per GB-month • HaaS: Hardware as a Service - You get access to hardware machines, do whatever you want with them (example, your own cluster) - Security risks • IaaS: Infrastructure as a Service - You get access to flexible computing and storage infrastructure. Virtualization or, for example a Linux environment are ways to achieve this - Examples: Amazon Web Services, Eucalyptus, Microsoft Azure
  • 49. On-Demand Access • PaaS: Platform as a Service - You get access to flexible computing and storage infrastructure, together with a software platform. - Example: Google AppEngine (Python, Java) • SaaS: Software as a Service - You get access to software services, when you need them. Often subsumes Service Oriented Architectures - Examples: Google docs, MS Office on demand
  • 50. Data-Intensive Applications • Computation-intensive computing - Example areas: MPI-based, high performance computing, grids - Typically run on supercomputers - the speed of supercomputers is benchmarked in "FLOPS" (FLoating point Operations Per Second), and not in terms of "MIPS" (Million Instructions Per Second), as for general-purpose computers • Data-intensive computing - Typically store data at datacenters - Use compute nodes nearby - Compute nodes run computation services - The focus is on I/O operations (disk and/or network) not on CPU utilization
  • 51. New Cloud Programming Paradigms Easy to write and run highly parallel programs in new cloud programming paradigms: • Google - MapReduce and Sawzall - MapReduce indexing a chain of 24 MapReduce jobs - Approx. 200K jobs processing 50PB/month (in 2006) • Amazon - Elastic MapReduce service (pay-as-you-go) • Yahoo! - Hadoop + Pig - WebMap a chain of 100 MapReduce jobs - 280 TB of data, 2500 nodes • Facebook - Approx. 300TB total, adding 2TB/day (in 2008) - 3K jobs processing 55TB/day
  • 52. Cloud Categories Private cloud (accessible only to company employees) Public cloud (service provided to any paying customer) If you are starting your own company should you use a public cloud or purchase your own private cloud? Power, cooling, management costs CPU usage, Storage usage
  • 53. To Outsource Or Not Medium-sized organization wishes to run a service for M months The services requires 128 servers (1024 cores) and 524 TB • Outsource (e.g., AWS) [monthly cost] - S3: $0.12 per GB/month – EC2: $0.10 per CPU/hour - Storage: $0.12 × 524 × 1000 ≈ $62.000 - Computation: $0.10 × 1024 × 24 × 30 ≈ $74.000 - Total: approx. $136.000 • Purchase [total cost] - Storage: approx. $349.000 - Total: approx. $1.555.000 + $7.500 per month for a system administrator per 100 nodes
  • 54. To Outsource Or Not Breakeven analysis à duration of usage defines the • Outsource (e.g., AWS) [monthly cost] - Storage: $0.12 × 524 × 1000 ≈ $62.000 per month - Total: approx. $136.000 per month • Purchase [total cost] - Storage: approx. $349.000 - Total: approx. $1.555.000 + ($7.500 per month) Breakeven points • Storage: $349.000/$62.000 ≈ 5.55 months • Total: $1.555.000/$136.000 ≈ 12 months ü Startups use clouds a lot – they do not know how long they will be in business ü Cloud providers benefit monetarily more from storage