2. Data Expertise / Lynn Langit
⢠Industry awards
â Microsoft â MVP for SQL Server
â Google â GDE for Cloud Platform
â 10Gen â Master for MongoDB
⢠Practicing Architect
⢠Technical author / trainer
â
â
â
â
Pluralsight â Google Cloud Series
DevelopMentor â SQL Server 2012 Series
2 books on SQL Server BI
Cloudera trainer (certified)
⢠Former MSFT FTE
â 4 years
5. Is Big Data = NoSQL and just Hadoop?
HUGE Hype factor since 2011
Apache Hadoop
⢠a software framework that supports data-intensive distributed
applications
⢠under a free license enables applications to work with thousands of
nodes and petabytes of data
⢠was inspired by Google's MapReduce and Google File System (GFS)
papers
7. How you âgetâ Hadoop
Open source
⢠roll your own
Commercial distribution
â˘
â˘
â˘
â˘
Cloudera
MapR
Hortonworks
MoreâŚ
Rent it via the cloud
⢠AWS
⢠HDInsight
12. Example Comparison: RDBMS vs. Hadoop
Traditional RDBMS
Hadoop / MapReduce
Data Size
Gigabytes (Terabytes)
Petabytes and greater
Access
Interactive and Batch
Batch â NOT Interactive
Updates
Read / Write many times
Write once, Read many times
Structure
Static Schema
Dynamic Schema
Integrity
High (ACID)
Low
Scaling
Nonlinear
Linear
Query
Response
Time
Can be near immediate
Has latency (due to batch
processing)
13. âSmallâ BigData vs. âBigâ BigData
On Premises
In the Cloud
Hadoop
Hadoop
NoSQL
NoSQL
RDBMS
RDBMS
14. But waitâŚ
is there a
relational database
that scales
that is cheap
that runs in the cloud?
15. DEMO - AWS Redshift
⢠About $1k per Terabyte per year - relational
29. Graph Databases
⢠a lot of many-to-many relationships
⢠recursive self-joins
⢠when your primary objective is quickly
finding connections, patterns and
relationships between the objects
within lots of data
⢠Examples:
â Neo4J
â Google Freebase
36. Cloud Offeringsâ RDBMS AND NoSQL
AWS
Google
Microsoft
RDBMS
RDS â all major
mySQL
SQL Azure
NoSQL buckets
S3 or Glacier
Cloud Storage
Azure Blobs
NoSQL Key-Value
DynamoDB
Cloud Datastore
Azure Tables
Streaming ML or
(Mahout)
Custom EC2
Prospective Search
&
Prediction API
StreamInsight
NoSQL Document or MongoDB on EC2
Graph
Freebase
MongoDB on
Windows Azure
NoSQL â Column
Hadoop (HBase)
Elastic MapReduce
using S3 & EC2
none
HDInsight
Dremel/Warehousi
ng
RedShift
BigQuery
none
39. Can Excel help?
Connector to
Hadoop
Data Explorer
Data Quality
Services
Master Data
Services
Integration
with Azure
Data Market
Visualize with
PowerView
Data Mining
w/Predixion
41. Other types of cloud data services
Hosting public datasets
⢠Pay to read
⢠Earn revenue by offering for
read
Cleaning / matching
(your) data
⢠ETL â Microsoft Data
Explorer, Google Refine
⢠Data Quality â Windows
Azure Data Market,
InfoChimps, DataMarket.com
42. Collecting for âBigDataâ
⢠Sensors everywhere
⢠Structured, Semi-structured, Unstructured vs. Data
Standards
⢠M2M
⢠Public Datasets
â Freebase
â Azure DataMarket
â Hillary Masonâs list
42
43. NoSQL To-Do List
Understand types of NoSQL databases
⢠Use NoSQL when business needs designate
⢠Use the right type of NoSQL for your business problem
Try out NoSQL on the cloud
⢠Quick and cheap for behavioral data
⢠Mashup cloud datasets
⢠Good for specialized use cases, i.e. dev, test , training environments
Learn NoSQL access technologies & services
⢠New query languages, i.e. MapReduce, R, Infer.NET
⢠New query tools (vendor-specific) â Google Refine, Amazon
Karmasphere, Microsoft Excel connectors, etcâŚ
⢠Windows Azure Data Market, other public data markets
45. Keep Learning
⢠Twitter: @LynnLangit
⢠YouTube:
http://www.youtube.com/user/SoCalDevGal
⢠Hire me
â To help build your BI/Big Data solution
â To teach your team next gen BI
â To learn more about using NoSQL
solutions
Hadoop on Azure -- http://msdn.microsoft.com/en-us/magazine/jj190805.aspxhttp://www.oracle.com/technetwork/bdc/hadoop-loader/overview/index.htmlhttp://www.microsoft.com/download/en/details.aspx?id=27584
http://hortonworks.com/technology/hortonworksdataplatform/More about Hbase, from the OâReilly âGetting Ready for BigDataâ reportâEnter HBase, a column-oriented database that runs on top of HDFS. Modeled after Googleâs BigTable, the projectâs goal is to host billions of rows of data for rapid access. MapReduce can use HBase as both a source and a destination for its computations, and Hive and Pig can be used in combination with HBase.In order to grant random access to the data, HBase does impose a few restrictions: performance with Hive is 4-5 times slower than plain HDFS, and the maximum amount of data you can store is approximately a petabyte, versus HDFSâ limit of over 30PB.âhttp://www.cloudera.com/
http://hortonworks.com/technology/hortonworksdataplatform/More about Hbase, from the OâReilly âGetting Ready for BigDataâ reportâEnter HBase, a column-oriented database that runs on top of HDFS. Modeled after Googleâs BigTable, the projectâs goal is to host billions of rows of data for rapid access. MapReduce can use HBase as both a source and a destination for its computations, and Hive and Pig can be used in combination with HBase.In order to grant random access to the data, HBase does impose a few restrictions: performance with Hive is 4-5 times slower than plain HDFS, and the maximum amount of data you can store is approximately a petabyte, versus HDFSâ limit of over 30PB.âhttp://www.cloudera.com/
http://nosql-database.org/http://hadoop.apache.org/ & http://www.mongodb.org/Wikipedia - http://en.wikipedia.org/wiki/NoSQLList of noSQL databases â http://nosql-database.org/The good, the bad - http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772
http://code.google.comAccess via REST APIsVery Cheap, but not much functionality includedLots of code to write for application developmentButâŚcan be a good backup solution
http://en.wikipedia.org/wiki/MongoDB & http://try.mongodb.org/http://www.mongodb.org/downloadshttp://www.mongodb.org/display/DOCS/Drivers
http://www.infinitegraph.com/what-is-a-graph-database.html and http://www.neo4j.org/http://en.wikipedia.org/wiki/Graph_databasehttp://www.freebase.com/
http://www.neo4j.org/learn/try
For Google - http://code.google.comFor AWS - https://console.aws.amazon.com/console/home
Hadoop on AWS - http://wiki.apache.org/hadoop/AmazonEC2
http://www.microsoft.com/en-us/bi/default.aspxhttp://dennyglee.com/Demos - Â Â http://www.youtube.com/watch?v=djfpPsGwm6Aand http://www.youtube.com/watch?v=uh9bKWO1K7U
DataMarkets â InfoChimps, Factual, DataMarket, Windows Azure Data Marketplace, Wolfram Alpha, Datasifthttp://www.microsoft.com/en-us/sqlazurelabs/default.aspx andhttp://www.microsoft.com/en-us/sqlazurelabs/labs/dataexplorer.aspxhttps://datamarket.azure.com/http://www.freebase.com/http://code.google.com/p/google-refine/