Handwritten Text Recognition for manuscripts and early printed texts
Â
Big Data Analytics Projects - Real World with Pentaho
1. Big Data Analytics Projects
in the Real World
Mark Kromer
Pentaho Big Data Analytics Product Manager
@mssqldude
@kromerbigdata
http://www.kromerbigdata.com
2. 1. The Big Data Technology Landscape
2. Big Data Analytics
3. Big Data Analytics Scenarios:
⯠Digital Marketing Analytics
âą Hadoop, Aster Data, SQL Server
⯠Sentiment Analysis
âą MongoDB, SQL Server
⯠Data Refinery
âą Hadoop, MPP, SQL Server, Pentaho
4. SQL Server in the Big Data world (Quasi-Real World)
What weâll (try) to cover today
3. Big Data 101
3 Vâs
⯠Volume â Terabyte records, transactions, tables, files
⯠Velocity â Batch, near-time, real-time (analytics), streams.
⯠Variety â Structures, unstructured, semi-structured, and all the above in a mix
Text Processing
⯠Techniques for processing and analyzing unstructured (and structured) LARGE files
Analytics & Insights
Distributed File System & Programming
4. Big Data â NoSQL
⯠NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not
the same thing
⯠Facebook, for example, uses Hbase from the Hadoop stack
⯠NoSQL does not have to be Big Data
Big Data â Real Time
⯠Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was
otherwise too complex to provide value
⯠Use in-memory analytics for real time insights
Big Data â Data Warehouse
⯠I still refer to large multi-TB DWs as âVLDBâ
⯠Big Data is about crunching stats in text files for discovery of new patterns and insights
⯠Use the DW to aggregate and store the summaries of those calculations for reporting
Markâs Big Data Myths
5. âą Batch Processing
âą Commodity Hardware
âą Data Locality, no shared
storage
âą Scales linearly
âą Great for large text file
processing, not so great on
small files
âą Distributed programming
paradigm
Hadoop 1.x
11. Apache Spark
High-Speed In-Memory Analytics over Hadoop
â Open Source
â Alternative to Map Reduce for certain applications
â A low latency cluster computing system
â For very large data sets
â May be 100 times faster than Map Reduce for
â Iterative algorithms
â Interactive data mining
â Used with Hadoop / HDFS
â Released under BSD License
16. Sentiment Analysis
Reference Architecture 2
Big Data
Platforms
Hadoop
PDW
MongoDB
Social Media
Sources
Data
Orchestration
Data Models
Analytical
Models
OLAP Cubes
Data Mining
OLAP
Analytics
Tools,
Reporting
Tools,
Dashboards
18. âą Distributed Data (Data Locality)
⯠HDFS / MapReduce
⯠YARN / TEZ
⯠Replicated / Sharded Data
âą MPP Databases
⯠Vertica, Aster, PDW, Greenplum ⊠In-database analytics that can scale-out with
distributed processing across nodes
âą Distributed Analytics
⯠SAS: Quickly solve complex problems using big data and sophisticated analytics in a
distributed, in-memory and parallel environment.â
http://www.sas.com/resources/whitepaper/wp_46345.pdf
âą In-memory Analytics
⯠Microsoft PowerPivot (Tabular models)
⯠SAP HANA
⯠Tableau
Big Data Analytics
Core Tenets
19. using Microsoft.Hadoop.MapReduce;
using System.Text.RegularExpressions;
public class TotalHitsForPageMap : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
context.Log(inputLine);
var parts = Regex.Split(inputLine, "s+");
if (parts.Length != expected) //only take records with all values
{
return;
}
context.EmitKeyValue(parts[pagePos], hit);
}
}
MapReduce Framework (Map)
20. public class TotalHitsForPageReducerCombiner : ReducerCombinerBase
{
public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
{
context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());
}
}
public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>
{
public override HadoopJobConfiguration Configure(ExecutorContext context)
{
var retVal = new HadoopJobConfiguration();
retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");
retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");
retVal.DeleteOutputFolder = true;
return retVal;
}
}
MapReduce Framework (Reduce & Job)
21. Linux shell commands to access data in HDFS
Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv
List files in HDFS:
c:Hadoop>hadoop fs -ls /import
Found 1 items
-rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv
View file in HDFS:
c:Hadoop>hadoop fs -cat /import/sales.csv
Kromer,123,5,55
Smith,567,1,25
Jones,123,9,99
James,11,12,1
Johnson,456,2,2.5
Singh,456,1,3.25
Yu,123,1,11
Now, we can work on the data with MapReduce, Hive, Pig, etc.
Get Data into Hadoop
22. create external table ext_sales
(
lastname string,
productid int,
quantity int,
sales_amount float
)
row format delimited fields terminated by ',' stored as textfile location
'/user/makromer/hiveext/input';
LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;
Use Hive for Data Schema and Analysis
23. sqoop import âconnect jdbc:sqlserver://localhost âusername sqoop -password password âtable customers -m 1
> hadoop fs -cat /user/mark/customers/part-m-00000
> 5,Bob Smith
sqoop export âconnect jdbc:sqlserver://localhost âusername sqoop -password password -m 1 âtable customers âexport-dir
/user/mark/data/employees3
12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec)
12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
Sqoop
Data transfer to & from Hadoop & SQL Server
24. Role of NoSQL in a Big Data Analytics Solution
⣠Use NoSQL to store data quickly without the overhead of RDBMS
⣠Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few
⣠Why NoSQL?
⣠In the world of âBig Dataâ
⣠âSchema laterâ
⣠Ignore ACID properties
⣠Drop data into key-value store quick & dirty
⣠Worry about query & read later
⣠Why NOT NoSQL?
⣠In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface
⣠SQL Server and NoSQL
⣠Not a natural fit
⣠Use HDFS or your favorite NoSQL database
⣠Consider turning off SQL Server locking mechanisms
⣠Focus on writes, not reads (read uncommitted)
25. MongoDB and Enterprise IT Stack
EDWHadoop
Management&Monitoring
Security&Auditing
RDBMS
CRM, ERP, Collaboration, Mobile, BI
OS & Virtualization, Compute, Storage, Network
RDBMS
Applications
Infrastructure
Data Management
Online Data Offline Data
27. Text Search Example
(e.g. address typo so do fuzzy match)
// Text search for address filtered by first name and NY
> db.ticks.runCommand(
âtextâ,
{ search: âvanderbilt ave. vander biltâ,
filter: {name: âSmithâ,
city: âNew Yorkâ} })
28. //Find total value of each customerâs accounts for a given RM (or Agent) sorted by value
db.accts.aggregate(
{ $match: {relationshipManager: âSmithâ}},
{ $group :
{ _id : â$ssnâ,
totalValue: {$sum: â$valueâ} }},
{ $sort: { totalValue: -1}} )
Aggregate: Total Value of Accounts
29. SQL Server Big Data â Data Loading
Amazon HDFS & EMR Data Loading
Amazon S3 Bucket
30. SQL Server Database
⯠SQL 2012 Enterprise Edition
⯠Page Compression
⯠2012 Columnar Compression on Fact Tables
⯠Clustered Index on all tables
⯠Auto-update Stats Asynch
⯠Partition Fact Tables by month and archive data with sliding window technique
⯠Drop all indexes before nightly ETL load jobs
⯠Rebuild all indexes when ETL completes
SQL Server Analysis Services
⯠SSAS 2012 Enterprise Edition
⯠2008 R2 OLAP cubes partition-aligned with DW
⯠2012 cubes in-memory tabular cubes
⯠All access through MSMDPUMP or SharePoint
SQL Server Big Data Environment
32. DBA ETL/BI Developer Business Users & Executives Analysts & Data Scientists
OPERATIONAL DATA BIG DATA DATA STREAMPUBLIC/PRIVATE CLOUDS
Enterprise &
Interactive
Reporting
Interactive
Analysis
Dashboards Predictive
Analytics
Pentaho Business Analytics
Data Integration
Instaview | Visual Map Reduce
DIRECT ACCESS
Pentaho Big Data Analytics
33. Pentaho Big Data Analytics
Accelerate the time to big data value
âą Full continuity from data
access to decisions â
complete data integration &
analytics for any big data
store
âą Faster development,
faster runtime â visual
development, distributed
execution
âą Instant and interactive
analysis â no coding and
no ETL required
34. Product Components
Pentaho Data Integration
âą Visual development for big data
âą Broad connectivity
âą Data quality & enrichment
âą Integrated scheduling
âą Security integration
âą Visual data exploration
âą Ad hoc analysis
âą Interactive charts & visualizations
Pentaho Dashboards
âą Self-service dashboard builder
âą Content linking & drill through
âą Highly customized mash-ups
Pentaho Data Mining &
Predictive Analytics
âą Model construction & evaluation
âą Learning schemes
âą Integration with 3rd part models
using PMML
Pentaho Enterprise &
Interactive Reports
âą Both ad hoc & distributed reporting
âą Drag & drop interactive reporting
âą Pixel-perfect enterprise reports
Pentaho for Big Data
MapReduce & Instaview
âą Visual Interface for Developing
MR
âą Self-service big data discovery
âą Big data access to Data Analysts
Pentaho Analyzer
35. ⯠Simple, easy-to-use visual data exploration
⯠Web-based thin client; in-memory caching
⯠Rich library of interactive visualizations
âą Geo-mapping, heat grids, scatter plots, bubble
charts, line over bar and more
âą Pluggable visualizations
⯠Java ROLAP engine to analyze structured and
unstructured data, with SQL dialects for querying
data from RDBMs
⯠Pluggable cache integrating with leading caching
architectures: Infinispan (JBoss Data Grid) &
Memcached
Pentaho Interactive Analysis & Data Discovery
Highly Flexible Advanced Visualizations