SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
Big Data for Oracle Professionals
Arup Nanda
Big Data Explorer
Time
Growth
Tweet @ArupNanda
Hadoop
Map/Reduce
YARN
NoSQL
Spark
Flume.
Tweet @ArupNanda
fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com
fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html
HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)"
fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400] "GET /news/news.html
HTTP/1.0" 200 16716 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)"
ppp931.on.bellglobal.com - - [26/Apr/2000:00:16:12 -0400] "GET
/download/windows/asctab31.zip HTTP/1.0" 200 1540096
"http://www.htmlgoodies.com/downloads/freeware/webdevelopment/15.html"
"Mozilla/4.7 [en]C-SYMPA (Win95; U)"
123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0"
200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200
8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF"
"Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - -
Tweet @ArupNanda
fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com
petabytes
unpredictable format
transient.
Tweet @ArupNanda
Metadata Repository
Tweet @ArupNanda
Tweet @ArupNanda
Tweet @ArupNanda
olumeV
arietyV
elocityV
Tweet @ArupNanda
CUSTOMERS
CUST_ID
NAME
ADDRESS
Tweet @ArupNanda
CUSTOMERS
CUST_ID
NAME
ADDRESS
SPOUSE
Tweet @ArupNanda
CUSTOMERS
CUST_ID
NAME
ADDRESS
SPOUSES
CUST_ID
NAME
CURRENT
Tweet @ArupNanda
CUSTOMERS
CUST_ID
NAME
ADDRESS
SPOUSES
CUST_ID
NAME
CURRENT
EMPLOYERS
CUST_ID
NAME
CURRENT
Tweet @ArupNanda
Name = Data
Relationship status = Data
Married to = Data
In a relationship with = Data
Friends = Data, Data, Data
Likes = Data, Data
Mutually Exclusive, Maybe not?
Multiple Data Points
Tweet @ArupNanda
First Name John
Spouse Jane
Child Jill
Goes to Acme School
Tweet @ArupNanda
First Name Martha
Child goes to Acme School
Tweet @ArupNanda
First Name John
Spouse Jane
Child Jill
Goes to Acme School
First Name Martha
Child goes to Acme School
Tweet @ArupNanda
First Name Martha
Child goes to Acme School
Teacher Mrs Gillen
Teacher Mrs Gillen
Jill
Tweet @ArupNanda
First Name John
Spouse Jane
Child Jill
Goes to Acme School
Teacher Mr Fullmeister
Tweet @ArupNanda
First Name Irene
Boyfriend Henry
Works at Starwood
Hobby Photography
Ex-Spouse Jane
Tweet @ArupNanda
Tweet @ArupNanda
First Name Irene
Key Value
Key-Value Pair
Tweet @ArupNanda
John Smith and his wife Jane,
along with their daughter Jill,
were strolling on the beach
when they heard a crash. John
ran towards …
Tweet @ArupNanda
Map
Tweet @ArupNanda
begin
get post
while (there_are_remaining_posts) loop
extract status of "like" for the specific post
if status = "like" then
like_count := like_count + 1
else
no_comment := no_comment + 1
end if
end loop
end
Counter()
Tweet @ArupNanda
Counter() Counter() Counter()
Tweet @ArupNanda
Counter() Counter() Counter()
Likes=100
No Comments=
300
Likes=50
No Comments=
350
Likes=150
No Comments=
250
Likes=300
No Comments=
900
Reduce
Tweet @ArupNanda
Map Reduce/
Dividing the
work among
different
nodes
Collating the
results to get
final answer
Tweet @ArupNanda
Counter
()
Counter
()
Counter
()
Likes=100
No
Comments=
300
Likes=50
No
Comments=
350
Likes=150
No
Comments=
250Likes=300
No
Comments=
900
• Divide the workload
• Submit and track the jobs
• If a job fails, restart it
on another node
• …
Hadoop
Tweet @ArupNanda
Resource Management
Applications
YARN
Yet Another Resource Negotiator
Map Reduce v2.
Tweet @ArupNanda
Counter() Counter() Counter()
Filesystem Filesystem Filesystem
1 2 32 3 13 1 2
Hadoop Distributed Filesystem (HDFS)
Tweet @ArupNanda
Count
er()
Count
er()
Count
er()
Filesystem Filesystem Filesystem
1 2 32 3 13 1 2
• Not shared storage
• Data is discrete
• Version control not required
• Concurrency not required
• Transactional integrity across
nodes not required.
Comparison
with RAC
Tweet @ArupNanda
Advantages of Hadoop
• Processors need not be super-fast
• Immensely scalable
• Storage is redundant by design
• No RAID level required.
Count
er()
Count
er()
Count
er()
Filesystem Filesystem Filesystem
1 2 32 3 13 1 2
Tweet @ArupNanda
Scalable?
ACID Properties
Reliability at a cost
Large overhead in data processing
Tweet @ArupNanda
Website logs
Combine with structured data
SOAP Messages
Twitter, Facebook …
Tweet @ArupNanda
Data Access: through programs
NoSQL Databases
Tweet @ArupNanda
Key Value
Key Value DB
Key Document
Document DB.
Key Value
Key Value
Key Value
{
empID:1,
empName:Larry
salary:infinity
}
Tweet @ArupNanda
SQL-interface required
Hive
HiveQL
Tweet @ArupNanda
Creating a Hive Table
create table accounts (
accno int,
accname string,
balance float
)
row format delimited
fields terminated by ‘,’
stored as texfile
location '/user/hive/db1.db/accounts'
Tweet @ArupNanda
select count(*)
from store_sales ss
join household_demographics hd on (ss.ss_hdemo_sk
= hd.hd_demo_sk)
join time_dim t on (ss.ss_sold_time_sk = t.t_time_sk)
join store s on (s.s_store_sk = ss.ss_store_sk)
where
t.t_hour = 8
t.t_minute >= 30
hd.hd_dep_count = 2
order by cnt;
HiveQL
Tweet @ArupNanda
Map/Reduce
Divide the work and
collate the results
Needs development
in Java, Python, Ruby, etc.
A framework to work on
the dataset in parallel Pig
Pig Latin
Scripting language for
Pig
Tweet @ArupNanda
select category, avg(pagerank)
from urls
where pagerank > 0.2
group by category
having count(*) > 1000000
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY
COUNT(good_urls)>1000000;
output = FOREACH big_groups GENERATE category,
AVG(good_urls.pagerank);
SQL
Pig Latin
Tweet @ArupNanda
HBase
HiveQL
Pig
A database built
on Hadoop
An SQL-like (but
not the same)
query language
Procedural Logic
without M/R Code.
Tweet @ArupNanda
normal programming
languages, e.g. Python
YARN
Map/Reduce code in Java
Spark
Tweet @ArupNanda
Count
er()
Count
er()
Count
er()
Filesystem Filesystem Filesystem
1 2 32 3 13 1 2
Hadoop processing in
files
Memory is cheaper
Interactive
processing needs
faster access.
Tweet @ArupNanda
Spark
Core
SparkShell SparkSQL MLib SparkR PySpark
Can use Java, Python or Scala
Tweet @ArupNanda
Divide and conquer is the key
Non-shared division of data is important
Local access
Redundancy
Hadoop is a framework
You have to write the programs
Big data is batch-oriented
Hive is SQL-like
Pig Latin is a 4GL-like scripting language
Spark uses memory
Tweet @ArupNanda
Oh, I so want to Learn!
Cloudera – prebuilt VMs
https://www.cloudera.com/documentation/ente
rprise/5-9-x/topics/cloudera_quickstart_vm.html
Hortonworks – prebuilt
VMs
https://hortonworks.com/downloads/#sandbox
Tweet @ArupNanda
Thanks!
arup.blogspot.com @ArupNanda
Tweet @ArupNanda

Mais conteúdo relacionado

Semelhante a Big Data and Hadoop Fundamentals for Oracle Professionals

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Amazon Athena (April 2017)
Amazon Athena (April 2017)Amazon Athena (April 2017)
Amazon Athena (April 2017)Julien SIMON
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure TechnologiesKoray Kocabas
 
SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)Robert Swisher
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
An Architect's guide to real time big data systems
An Architect's guide to real time big data systemsAn Architect's guide to real time big data systems
An Architect's guide to real time big data systemsRaja SP
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Holden Karau
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 
Structured streaming for machine learning
Structured streaming for machine learningStructured streaming for machine learning
Structured streaming for machine learningSeth Hendrickson
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInDatabricks
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptAbhijitManna19
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptsnowflakebatch
 
strata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingstrata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingShidrokhGoudarzi1
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeWim Godden
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptxDori Waldman
 

Semelhante a Big Data and Hadoop Fundamentals for Oracle Professionals (20)

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Amazon Athena (April 2017)
Amazon Athena (April 2017)Amazon Athena (April 2017)
Amazon Athena (April 2017)
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure Technologies
 
SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)
 
Velocity 2015-final
Velocity 2015-finalVelocity 2015-final
Velocity 2015-final
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
An Architect's guide to real time big data systems
An Architect's guide to real time big data systemsAn Architect's guide to real time big data systems
An Architect's guide to real time big data systems
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Structured streaming for machine learning
Structured streaming for machine learningStructured streaming for machine learning
Structured streaming for machine learning
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedIn
 
Epic South Disasters
Epic South DisastersEpic South Disasters
Epic South Disasters
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingstrata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streaming
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 

Mais de Gerger

Source Control for the Oracle Database
Source Control for the Oracle DatabaseSource Control for the Oracle Database
Source Control for the Oracle DatabaseGerger
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
Best Way to Write SQL in Java
Best Way to Write SQL in JavaBest Way to Write SQL in Java
Best Way to Write SQL in JavaGerger
 
Version control for PL/SQL
Version control for PL/SQLVersion control for PL/SQL
Version control for PL/SQLGerger
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGerger
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGerger
 
PostgreSQL for Oracle Developers and DBA's
PostgreSQL for Oracle Developers and DBA'sPostgreSQL for Oracle Developers and DBA's
PostgreSQL for Oracle Developers and DBA'sGerger
 
Shaping Optimizer's Search Space
Shaping Optimizer's Search SpaceShaping Optimizer's Search Space
Shaping Optimizer's Search SpaceGerger
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGerger
 
Monitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixMonitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixGerger
 
Introducing ProHuddle
Introducing ProHuddleIntroducing ProHuddle
Introducing ProHuddleGerger
 
Use Cases of Row Pattern Matching in Oracle 12c
Use Cases of Row Pattern Matching in Oracle 12cUse Cases of Row Pattern Matching in Oracle 12c
Use Cases of Row Pattern Matching in Oracle 12cGerger
 
Introducing Gitora,the version control tool for PL/SQL
Introducing Gitora,the version control tool for PL/SQLIntroducing Gitora,the version control tool for PL/SQL
Introducing Gitora,the version control tool for PL/SQLGerger
 

Mais de Gerger (13)

Source Control for the Oracle Database
Source Control for the Oracle DatabaseSource Control for the Oracle Database
Source Control for the Oracle Database
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Best Way to Write SQL in Java
Best Way to Write SQL in JavaBest Way to Write SQL in Java
Best Way to Write SQL in Java
 
Version control for PL/SQL
Version control for PL/SQLVersion control for PL/SQL
Version control for PL/SQL
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQL
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQL
 
PostgreSQL for Oracle Developers and DBA's
PostgreSQL for Oracle Developers and DBA'sPostgreSQL for Oracle Developers and DBA's
PostgreSQL for Oracle Developers and DBA's
 
Shaping Optimizer's Search Space
Shaping Optimizer's Search SpaceShaping Optimizer's Search Space
Shaping Optimizer's Search Space
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQL
 
Monitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixMonitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with Zabbix
 
Introducing ProHuddle
Introducing ProHuddleIntroducing ProHuddle
Introducing ProHuddle
 
Use Cases of Row Pattern Matching in Oracle 12c
Use Cases of Row Pattern Matching in Oracle 12cUse Cases of Row Pattern Matching in Oracle 12c
Use Cases of Row Pattern Matching in Oracle 12c
 
Introducing Gitora,the version control tool for PL/SQL
Introducing Gitora,the version control tool for PL/SQLIntroducing Gitora,the version control tool for PL/SQL
Introducing Gitora,the version control tool for PL/SQL
 

Último

Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 

Último (20)

Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 

Big Data and Hadoop Fundamentals for Oracle Professionals

  • 1. Big Data for Oracle Professionals Arup Nanda Big Data Explorer Time Growth Tweet @ArupNanda
  • 2. Hadoop Map/Reduce YARN NoSQL Spark Flume. Tweet @ArupNanda fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400] "GET /news/news.html HTTP/1.0" 200 16716 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" ppp931.on.bellglobal.com - - [26/Apr/2000:00:16:12 -0400] "GET /download/windows/asctab31.zip HTTP/1.0" 200 1540096 "http://www.htmlgoodies.com/downloads/freeware/webdevelopment/15.html" "Mozilla/4.7 [en]C-SYMPA (Win95; U)" 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200 8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - Tweet @ArupNanda
  • 3. fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com petabytes unpredictable format transient. Tweet @ArupNanda Metadata Repository Tweet @ArupNanda
  • 7. CUSTOMERS CUST_ID NAME ADDRESS SPOUSES CUST_ID NAME CURRENT EMPLOYERS CUST_ID NAME CURRENT Tweet @ArupNanda Name = Data Relationship status = Data Married to = Data In a relationship with = Data Friends = Data, Data, Data Likes = Data, Data Mutually Exclusive, Maybe not? Multiple Data Points Tweet @ArupNanda
  • 8. First Name John Spouse Jane Child Jill Goes to Acme School Tweet @ArupNanda First Name Martha Child goes to Acme School Tweet @ArupNanda
  • 9. First Name John Spouse Jane Child Jill Goes to Acme School First Name Martha Child goes to Acme School Tweet @ArupNanda First Name Martha Child goes to Acme School Teacher Mrs Gillen Teacher Mrs Gillen Jill Tweet @ArupNanda
  • 10. First Name John Spouse Jane Child Jill Goes to Acme School Teacher Mr Fullmeister Tweet @ArupNanda First Name Irene Boyfriend Henry Works at Starwood Hobby Photography Ex-Spouse Jane Tweet @ArupNanda
  • 11. Tweet @ArupNanda First Name Irene Key Value Key-Value Pair Tweet @ArupNanda
  • 12. John Smith and his wife Jane, along with their daughter Jill, were strolling on the beach when they heard a crash. John ran towards … Tweet @ArupNanda Map Tweet @ArupNanda
  • 13. begin get post while (there_are_remaining_posts) loop extract status of "like" for the specific post if status = "like" then like_count := like_count + 1 else no_comment := no_comment + 1 end if end loop end Counter() Tweet @ArupNanda Counter() Counter() Counter() Tweet @ArupNanda
  • 14. Counter() Counter() Counter() Likes=100 No Comments= 300 Likes=50 No Comments= 350 Likes=150 No Comments= 250 Likes=300 No Comments= 900 Reduce Tweet @ArupNanda Map Reduce/ Dividing the work among different nodes Collating the results to get final answer Tweet @ArupNanda
  • 15. Counter () Counter () Counter () Likes=100 No Comments= 300 Likes=50 No Comments= 350 Likes=150 No Comments= 250Likes=300 No Comments= 900 • Divide the workload • Submit and track the jobs • If a job fails, restart it on another node • … Hadoop Tweet @ArupNanda Resource Management Applications YARN Yet Another Resource Negotiator Map Reduce v2. Tweet @ArupNanda
  • 16. Counter() Counter() Counter() Filesystem Filesystem Filesystem 1 2 32 3 13 1 2 Hadoop Distributed Filesystem (HDFS) Tweet @ArupNanda Count er() Count er() Count er() Filesystem Filesystem Filesystem 1 2 32 3 13 1 2 • Not shared storage • Data is discrete • Version control not required • Concurrency not required • Transactional integrity across nodes not required. Comparison with RAC Tweet @ArupNanda
  • 17. Advantages of Hadoop • Processors need not be super-fast • Immensely scalable • Storage is redundant by design • No RAID level required. Count er() Count er() Count er() Filesystem Filesystem Filesystem 1 2 32 3 13 1 2 Tweet @ArupNanda Scalable? ACID Properties Reliability at a cost Large overhead in data processing Tweet @ArupNanda
  • 18. Website logs Combine with structured data SOAP Messages Twitter, Facebook … Tweet @ArupNanda Data Access: through programs NoSQL Databases Tweet @ArupNanda
  • 19. Key Value Key Value DB Key Document Document DB. Key Value Key Value Key Value { empID:1, empName:Larry salary:infinity } Tweet @ArupNanda SQL-interface required Hive HiveQL Tweet @ArupNanda
  • 20. Creating a Hive Table create table accounts ( accno int, accname string, balance float ) row format delimited fields terminated by ‘,’ stored as texfile location '/user/hive/db1.db/accounts' Tweet @ArupNanda select count(*) from store_sales ss join household_demographics hd on (ss.ss_hdemo_sk = hd.hd_demo_sk) join time_dim t on (ss.ss_sold_time_sk = t.t_time_sk) join store s on (s.s_store_sk = ss.ss_store_sk) where t.t_hour = 8 t.t_minute >= 30 hd.hd_dep_count = 2 order by cnt; HiveQL Tweet @ArupNanda
  • 21. Map/Reduce Divide the work and collate the results Needs development in Java, Python, Ruby, etc. A framework to work on the dataset in parallel Pig Pig Latin Scripting language for Pig Tweet @ArupNanda select category, avg(pagerank) from urls where pagerank > 0.2 group by category having count(*) > 1000000 good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>1000000; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); SQL Pig Latin Tweet @ArupNanda
  • 22. HBase HiveQL Pig A database built on Hadoop An SQL-like (but not the same) query language Procedural Logic without M/R Code. Tweet @ArupNanda normal programming languages, e.g. Python YARN Map/Reduce code in Java Spark Tweet @ArupNanda
  • 23. Count er() Count er() Count er() Filesystem Filesystem Filesystem 1 2 32 3 13 1 2 Hadoop processing in files Memory is cheaper Interactive processing needs faster access. Tweet @ArupNanda Spark Core SparkShell SparkSQL MLib SparkR PySpark Can use Java, Python or Scala Tweet @ArupNanda
  • 24. Divide and conquer is the key Non-shared division of data is important Local access Redundancy Hadoop is a framework You have to write the programs Big data is batch-oriented Hive is SQL-like Pig Latin is a 4GL-like scripting language Spark uses memory Tweet @ArupNanda Oh, I so want to Learn! Cloudera – prebuilt VMs https://www.cloudera.com/documentation/ente rprise/5-9-x/topics/cloudera_quickstart_vm.html Hortonworks – prebuilt VMs https://hortonworks.com/downloads/#sandbox Tweet @ArupNanda