SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
HDInsight	
  Essentials	
  ISBN	
  :	
  1849695369	
  	
  /	
  ISBN	
  13	
  :	
  9781849695367	
  
Rajesh	
  Nadipalli	
  
05/01/2014	
  
Goals	
  of	
  this	
  Book	
  
• Focus	
  on	
  Microso'’s	
  new	
  Hadoop	
  
distribu=on	
  
• Serve	
  as	
  Quick	
  Reference	
  
• Provide	
  an	
  Overview	
  of	
  Hadoop	
  
• Address	
  both	
  cloud	
  and	
  on-­‐premise	
  setup	
  
for	
  HDInsight	
  
• Highlight	
  HDInsight	
  differen:ator	
  	
  
• Provide	
  Prac=cal	
  &	
  Real	
  world	
  examples	
  
Book	
  Table	
  of	
  Contents	
  
•  Chapter	
  1:	
  	
  HDInsight	
  in	
  a	
  Heartbeat	
  
•  Chapter	
  2:	
  	
  Deployment	
  HDInsight	
  on	
  premise	
  
•  Chapter	
  3:	
  	
  HDInsight	
  Azure	
  cloud	
  service	
  
•  Chapter	
  4:	
  	
  Administer	
  your	
  cluster	
  
•  Chapter	
  5:	
  	
  Ingest	
  data	
  to	
  your	
  cluster	
  
•  Chapter	
  6:	
  	
  Transform	
  data	
  in	
  your	
  cluster	
  
•  Chapter	
  7:	
  	
  Analyze	
  &	
  Report	
  data	
  from	
  cluster	
  
•  Chapter	
  8:	
  	
  Project	
  Planning	
  &	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Architectural	
  Considera=ons	
  
CHAPTER	
  1	
  HIGHLIGHTS:	
  	
  
HDINSIGHT	
  IN	
  A	
  HEARTBEAT	
  
Big	
  Data	
  Problem	
  Characteristics	
  	
  
Hadoop	
  Overview	
  
Self Healing
Distributed Storage
Fault Tolerant
Distributed
Computing
+
Abstraction for
Parallel Processing
CORE HADOOP COMPONENTS •  HDFS:	
  Distributed	
  
Storage	
  –	
  replicated,	
  
self-­‐healing	
  and	
  
scalable	
  
	
  
•  MapReduce:	
  	
  Parallel	
  
Processing,	
  process	
  
local	
  data	
  for	
  efficiency	
  	
  
	
  
NameNode
JobTracker
TaskTracker	
  
	
  
TaskTracker	
  
	
  
TaskTracker	
  
	
  MapReduce	
  
Layer	
  
Distributed	
  	
  
File	
  System	
  
Layer	
   Secondary
NameNode
Master	
  Node	
   Slaves	
  Nodes	
  
DataNode	
  
	
  
DataNode	
  
	
  
DataNode	
  
	
  
Hadoop	
  Nodes	
  Layout	
  
Data	
  Sources	
  
	
  
	
  
	
  
RDBMS	
  	
  
Databases	
  
Audio,	
  	
  
Images	
   Log	
  Files	
  
Sensors,	
  	
  
RFID	
  
Social	
  	
  
Media,	
  Feeds	
  
	
  
Hadoop	
  Data	
  Store	
  
	
  
	
  
	
  
	
  
HDFS	
  
Hbase	
  	
  (NOSQL	
  DB)	
  
	
  
Data	
  Processing	
  
	
  
	
  
	
  
Mapreduce	
  
	
  
Data	
  Access	
  
	
  
	
  
	
  
Hive	
   Pig	
  
Mahout	
  	
  
Machine	
  Learning	
  
Flume,	
  Sqoop	
  
Excel	
  
Business	
  	
  
Data	
  Feeds	
  
Zookeeper	
  (Distributed	
  Process	
  Management)	
  
Hcatalog	
  (Metadata	
  on	
  Pig,	
  Hive,	
  MapReduce	
  )	
  
Oozie	
  	
  
Workflow,	
  Scheduler	
  
Infrastructure	
  ,	
  Opera:ons	
  
(Monitoring,	
  Configura<on)	
  
Hadoop	
  Eco	
  System	
  
Collect & Import
to HDFS
Process
(MapReduce)
Analyze
(BI Tools)
Report & Publish
End	
  to	
  End	
  Solution	
  on	
  Hadoop	
  
Popular	
  Hadoop	
  Distributions	
  
•  Amazon	
  Elas=c	
  MapReduce	
  (cloud,	
  hbp://aws.amazon.com/
elas=cmapreduce/)	
  
	
  
•  Cloudera	
  (
hbp://www.cloudera.com/content/cloudera/en/home.html)	
  
	
  
•  EMC	
  PivitolHD	
  (hbp://gopivotal.com/)	
  
	
  
•  Hortonworks	
  HDP	
  (hbp://hortonworks.com/)	
  
	
  
•  MapR	
  (hbp://mapr.com/)	
  
	
  
•  Microsod	
  HDInsight	
  (cloud,	
  hbp://www.windowsazure.com/)	
  
HDInsight	
  Differenciator	
  
•  Enterprise-­‐ready	
  Hadoop	
  backed	
  by	
  Microsod	
  
	
  
•  Analy:cs	
  using	
  Excel	
  
•  Integra=on	
  with	
  Ac=ve	
  Directory.	
  
	
  	
  
•  Integra=on	
  with	
  .NET	
  and	
  Javascript	
  
	
  
•  Connectors	
  to	
  RDBMS	
  
	
  
•  Scale	
  using	
  cloud	
  offering:	
  	
  Azure	
  HDInsight	
  service	
  enables	
  customers	
  
to	
  scale	
  quickly	
  and	
  has	
  seamless	
  interface	
  between	
  HDFS	
  and	
  Azure	
  
Storage	
  Vault	
  
	
  
•  JavaScript	
  Console	
  
WordCount	
  in	
  HDInsight	
  
CHAPTER	
  2	
  HIGHLIGHTS:	
  	
  
HDINSIGHT	
  INSTALL	
  ON	
  PREMISE	
  
Apache	
  Hadoop	
  
	
  
	
  
	
  
•  Open	
  Source	
  Sodware	
  
•  Community	
  Development	
  
	
  	
  
Hortonworks	
  Data	
  PlaSorm	
  
	
  
	
  
	
  
•  Enterprise	
  Hadoop	
  Plagorm	
  (HDP)	
  
•  Leaders	
  in	
  Hadoop	
  
•  Code	
  commibers	
  to	
  Hadoop	
  
Microso'	
  HDInsight	
  
	
  
	
  
	
  
•  Built	
  on	
  top	
  of	
  HDP	
  
•  Integra=on	
  with	
  ASV,	
  Excel,	
  Powerview,	
  
SQLServer,	
  Ac=ve	
  Directory	
  
	
  	
  
HDInsight	
  Distribution	
  
Physical	
  Install	
  Options	
  
NN	
  	
  	
  	
  	
  SNN	
  	
  	
  	
  	
  	
  JT	
  
DN	
  	
  /	
  TT	
  
Single	
  node	
  for	
  development/test	
  	
  	
  
Mul=	
  node	
  for	
  produc=on	
  	
  	
  
Multi	
  Node	
  Install	
  Steps	
  
•  Pre-­‐requisites	
  
•  Networking	
  Setup	
  
•  Remote	
  Scrip=ng	
  
•  Firewall	
  Setup	
  
•  Sodware	
  Install	
  (each	
  node)	
  
•  Hadoop	
  Configura=on	
  
•  Verifica=on	
  
CHAPTER	
  3	
  HIGHLIGHTS:	
  	
  
HDINSIGHT	
  AZURE	
  SERVICE	
  
Azure	
  Cloud	
  Service	
  
Create	
  Storage	
  
Create	
  HDInsight	
  
cluster	
  
CHAPTER	
  4	
  HIGHLIGHTS:	
  	
  
ADMINISTER	
  YOUR	
  CLUSTER	
  
HDInsight	
  Cluster	
  Management	
  
HDInsight	
  Dashboard	
  
HDInsight	
  Dashboard	
  
NameNode	
  Status	
  
Jobtracker	
  Status	
  
CHAPTER	
  5	
  HIGHLIGHTS:	
  	
  
INGEST	
  DATA	
  INTO	
  YOUR	
  CLUSTER	
  
Loading	
  Data	
  into	
  your	
  Cluster	
  
You	
  have	
  following	
  op=ons…	
  
	
  
•  Loading	
  data	
  using	
  Hadoop	
  commands	
  
•  Loading	
  data	
  using	
  Azure	
  Storage	
  Vault	
  
•  Loading	
  data	
  using	
  Interac:ve	
  JavaScript	
  	
  
•  Shipping	
  data	
  to	
  your	
  Cluster	
  
•  Loading	
  data	
  from	
  RDBMS	
  via	
  Sqoop	
  
Loading	
  via	
  Azure	
  Storage	
  Explorer	
  
CHAPTER	
  6	
  HIGHLIGHTS:	
  	
  
TRANSFORM	
  YOUR	
  DATA	
  
Transforming	
  Data	
  
You	
  have	
  following	
  op=ons…	
  
	
  
•  MapReduce	
  
•  Hive	
  
•  Pig	
  
•  Others	
  
Processing	
  Data	
  in	
  Cluster	
  
Map for
Jan2012
Map for
Feb2012
Map for
Apr2013
…	
  
One Reducer
HDFS	
  
Hive	
  
JDBC/OBDC
Metastore
Thrift Server
Command LineWeb GUI
Driver
(Parser, Planner, Executor)
MapReduce	
  
Hive	
  
Raw	
  Data	
  in	
  HDFS	
  
•  Distributed	
  
Storage	
  
•  Reliable	
  
Data	
  Processing	
  via	
  Pig	
  
•  Pipelines	
  
•  Itera=ve	
  Processing	
  
•  Research	
  
Data	
  
Warehouse	
  
HDFS	
  
Data	
  Warehouse	
  via	
  Hive	
  
•  BI	
  Tools	
  
•  Analysis	
  
Hive	
  or	
  Pig?	
  
CHAPTER	
  7	
  HIGHLIGHTS:	
  	
  
ANALYZE	
  &	
  REPORT	
  
Analyze	
  using	
  Excel	
  
Analyze	
  using	
  Excel	
  
CHAPTER	
  8:	
  	
  
PROJECT	
  PLANNING	
  &	
  ARCHITECTURAL	
  
CONSIDERATIONS	
  
Execu:ve	
  &	
  
Stakeholder	
  	
  
Buy-­‐in	
  
Discovery	
  &	
  
Analysis	
  
Design	
  
Implementa:on	
  User	
  Acceptance	
  
Produc:on	
  
Opera:ons	
  
Feedback,	
  New	
  
Requirements	
  

Mais conteúdo relacionado

Mais procurados

The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for HadoopHBaseCon
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Joydeep Sen Sarma
 
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoPostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoUri Savelchev
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceJoydeep Sen Sarma
 
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMaintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMapR Technologies
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigDataWorks Summit/Hadoop Summit
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 

Mais procurados (20)

The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoPostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
 
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMaintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 

Semelhante a Hd insight essentials quick view

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanJim Kaskade
 
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsightNaoki (Neo) SATO
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 

Semelhante a Hd insight essentials quick view (20)

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Hadoop
HadoopHadoop
Hadoop
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
 
Big data in Azure
Big data in AzureBig data in Azure
Big data in Azure
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Robin_Hadoop
Robin_HadoopRobin_Hadoop
Robin_Hadoop
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
 

Último

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 

Último (20)

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 

Hd insight essentials quick view

  • 1. HDInsight  Essentials  ISBN  :  1849695369    /  ISBN  13  :  9781849695367   Rajesh  Nadipalli   05/01/2014  
  • 2. Goals  of  this  Book   • Focus  on  Microso'’s  new  Hadoop   distribu=on   • Serve  as  Quick  Reference   • Provide  an  Overview  of  Hadoop   • Address  both  cloud  and  on-­‐premise  setup   for  HDInsight   • Highlight  HDInsight  differen:ator     • Provide  Prac=cal  &  Real  world  examples  
  • 3. Book  Table  of  Contents   •  Chapter  1:    HDInsight  in  a  Heartbeat   •  Chapter  2:    Deployment  HDInsight  on  premise   •  Chapter  3:    HDInsight  Azure  cloud  service   •  Chapter  4:    Administer  your  cluster   •  Chapter  5:    Ingest  data  to  your  cluster   •  Chapter  6:    Transform  data  in  your  cluster   •  Chapter  7:    Analyze  &  Report  data  from  cluster   •  Chapter  8:    Project  Planning  &                                              Architectural  Considera=ons  
  • 4. CHAPTER  1  HIGHLIGHTS:     HDINSIGHT  IN  A  HEARTBEAT  
  • 5. Big  Data  Problem  Characteristics    
  • 6. Hadoop  Overview   Self Healing Distributed Storage Fault Tolerant Distributed Computing + Abstraction for Parallel Processing CORE HADOOP COMPONENTS •  HDFS:  Distributed   Storage  –  replicated,   self-­‐healing  and   scalable     •  MapReduce:    Parallel   Processing,  process   local  data  for  efficiency      
  • 7. NameNode JobTracker TaskTracker     TaskTracker     TaskTracker    MapReduce   Layer   Distributed     File  System   Layer   Secondary NameNode Master  Node   Slaves  Nodes   DataNode     DataNode     DataNode     Hadoop  Nodes  Layout  
  • 8. Data  Sources         RDBMS     Databases   Audio,     Images   Log  Files   Sensors,     RFID   Social     Media,  Feeds     Hadoop  Data  Store           HDFS   Hbase    (NOSQL  DB)     Data  Processing         Mapreduce     Data  Access         Hive   Pig   Mahout     Machine  Learning   Flume,  Sqoop   Excel   Business     Data  Feeds   Zookeeper  (Distributed  Process  Management)   Hcatalog  (Metadata  on  Pig,  Hive,  MapReduce  )   Oozie     Workflow,  Scheduler   Infrastructure  ,  Opera:ons   (Monitoring,  Configura<on)   Hadoop  Eco  System  
  • 9. Collect & Import to HDFS Process (MapReduce) Analyze (BI Tools) Report & Publish End  to  End  Solution  on  Hadoop  
  • 10. Popular  Hadoop  Distributions   •  Amazon  Elas=c  MapReduce  (cloud,  hbp://aws.amazon.com/ elas=cmapreduce/)     •  Cloudera  ( hbp://www.cloudera.com/content/cloudera/en/home.html)     •  EMC  PivitolHD  (hbp://gopivotal.com/)     •  Hortonworks  HDP  (hbp://hortonworks.com/)     •  MapR  (hbp://mapr.com/)     •  Microsod  HDInsight  (cloud,  hbp://www.windowsazure.com/)  
  • 11. HDInsight  Differenciator   •  Enterprise-­‐ready  Hadoop  backed  by  Microsod     •  Analy:cs  using  Excel   •  Integra=on  with  Ac=ve  Directory.       •  Integra=on  with  .NET  and  Javascript     •  Connectors  to  RDBMS     •  Scale  using  cloud  offering:    Azure  HDInsight  service  enables  customers   to  scale  quickly  and  has  seamless  interface  between  HDFS  and  Azure   Storage  Vault     •  JavaScript  Console  
  • 13. CHAPTER  2  HIGHLIGHTS:     HDINSIGHT  INSTALL  ON  PREMISE  
  • 14. Apache  Hadoop         •  Open  Source  Sodware   •  Community  Development       Hortonworks  Data  PlaSorm         •  Enterprise  Hadoop  Plagorm  (HDP)   •  Leaders  in  Hadoop   •  Code  commibers  to  Hadoop   Microso'  HDInsight         •  Built  on  top  of  HDP   •  Integra=on  with  ASV,  Excel,  Powerview,   SQLServer,  Ac=ve  Directory       HDInsight  Distribution  
  • 15. Physical  Install  Options   NN          SNN            JT   DN    /  TT   Single  node  for  development/test       Mul=  node  for  produc=on      
  • 16. Multi  Node  Install  Steps   •  Pre-­‐requisites   •  Networking  Setup   •  Remote  Scrip=ng   •  Firewall  Setup   •  Sodware  Install  (each  node)   •  Hadoop  Configura=on   •  Verifica=on  
  • 17. CHAPTER  3  HIGHLIGHTS:     HDINSIGHT  AZURE  SERVICE  
  • 18. Azure  Cloud  Service   Create  Storage   Create  HDInsight   cluster  
  • 19. CHAPTER  4  HIGHLIGHTS:     ADMINISTER  YOUR  CLUSTER  
  • 25. CHAPTER  5  HIGHLIGHTS:     INGEST  DATA  INTO  YOUR  CLUSTER  
  • 26. Loading  Data  into  your  Cluster   You  have  following  op=ons…     •  Loading  data  using  Hadoop  commands   •  Loading  data  using  Azure  Storage  Vault   •  Loading  data  using  Interac:ve  JavaScript     •  Shipping  data  to  your  Cluster   •  Loading  data  from  RDBMS  via  Sqoop  
  • 27. Loading  via  Azure  Storage  Explorer  
  • 28. CHAPTER  6  HIGHLIGHTS:     TRANSFORM  YOUR  DATA  
  • 29. Transforming  Data   You  have  following  op=ons…     •  MapReduce   •  Hive   •  Pig   •  Others  
  • 30. Processing  Data  in  Cluster   Map for Jan2012 Map for Feb2012 Map for Apr2013 …   One Reducer
  • 31. HDFS   Hive   JDBC/OBDC Metastore Thrift Server Command LineWeb GUI Driver (Parser, Planner, Executor) MapReduce   Hive  
  • 32. Raw  Data  in  HDFS   •  Distributed   Storage   •  Reliable   Data  Processing  via  Pig   •  Pipelines   •  Itera=ve  Processing   •  Research   Data   Warehouse   HDFS   Data  Warehouse  via  Hive   •  BI  Tools   •  Analysis   Hive  or  Pig?  
  • 33. CHAPTER  7  HIGHLIGHTS:     ANALYZE  &  REPORT  
  • 36. CHAPTER  8:     PROJECT  PLANNING  &  ARCHITECTURAL   CONSIDERATIONS  
  • 37. Execu:ve  &   Stakeholder     Buy-­‐in   Discovery  &   Analysis   Design   Implementa:on  User  Acceptance   Produc:on   Opera:ons   Feedback,  New   Requirements