2. • SCQAA-SF (www.scqaa.net) chapter sponsors the
sharing of information to promote and encourage the
improvement in information technology quality practices
and principles through networking, training and
professional development.
• Networking: We meet once in 2 months in San Fernando
Valley.
• Check us out on LinkedIn (SCQAA-SF)
• Contact Sujit at sujit58@gmail.com or call 818-878-0834
About SCQAA-SF- A Not-for Profit
Organization
June 13, 2013 2
3. Membership Benefits:
• Excellent speaker presentations on
advancements in technology and
methodology
• Networking opportunities
• PDU, CSTE and CSQA credits
• Regular meetings are free for members
and include dinner
June 13, 2013 3
4. Membership Policy
• Recently revised our membership dues
policy to better accommodate member
needs and current economic conditions.
• Annual membership is $50, or $35 for
those who are in between jobs.
• Please check your renewal with Cheryl
Leoni. If you have recently joined or
renewed, please check before renewing
again
June 13, 2013 4
6. Agenda
• Big Data and modern data management
• Old BI and New BI
• Hadoop Frameworks
• Big Data Quality – Hybrid Approach
• Big Data Processing - ETL
• Examples of Hadoop ETL/QA
• Big Data QA ToDo
• Q/A
7. Big Data
• Today, useful data is 80% unstructured and
20% structured data
• Not easy to build old style warehouses, very
expensive to build and maintain
• Today, business need is real time and
actionable insight driven
• Big Data features volume, variety, velocity and
veracity
• Fact - Business need actionable intelligence to
succeed
9. Obama Election and Big Data
• “The Obama campaign found a way to integrate social media, technology, email
databases, fundraising databases and consumer market data,” said GOP digital
strategist Vincent Harris, who did digital work for Newt Gingrich and Rick Perry in
2012. “That does not exist on the Republican side to that degree”, to the
detriment of Mitt Romney’s campaign, quoted by Politico, “GOP seeks to up its
online game”, December 8, 2012. For more on how the Obama campaign used big
data, see BusinessWeek’s November 29, 2012 article “The Science Behind Those
Obama Campaign Emails”.
10. BI = ‘Current State’ Questions
•What did we sell?
•When did we sell it?
•Where did we sell it?
•What did we sell with it?
Collecting
Transactional
data
11. BigData = ‘Next State’ BI
Questions
• What could happen?
• Why didn’t this happen?
• When will the next new thing
happen?
• What will the next new thing be?
• What should happen?
Collecting
behavioral
temporal
data
12. Comparing old and new BI data
Old BI data New BI data
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
DBA Ratio 1:40 1:3000
Reference: Tom White’s Hadoop: The Definitive Guide
23. Big Data QA Process
• Hybrid approach - can use traditional perl like
scripting, tools , Junit tests on destination side
• Use Hadoop jobs to refine and do ETL for
unstructured data at source side
• Improve upstream QA process to do most of
ETL/QA at source
• Leverage Hadoop infrastructure to do mining
• Fact – Big Data QA window is getting smaller
24. Microsoft SSIS - Hadoop ETL
• Use ODBC driver to extract data from any
Hadoop HDFS
• Use HDInsight ( Microsoft Hadoop ) as data
store
• Use SSIS for ETL
• Source lookups from Melissa Data and others
• Load to SQL Server
Reference URL :
http://sqlmag.com/blog/use-ssis-etl-hadoop
25. Amazon EMR - Hadoop ETL
• Design and code a JOB on Amazon AWS using
EMR (elastic map reduce )
• Source lookups from Melissa Data and others
• Run the job to do ETL
• Read and write to S3 buckets
• Use open source Pig/Latin, Java UDFs for ETL
Reference URL :
http://docs.aws.amazon.com/ElasticMapReduc
e/latest/DeveloperGuide/emr-etl.html
29. BI >BigData QA ‘To Do List
Get trained and Store some (more) data on the cloud
• Relational and non-relational
Process some data in the cloud
• Do ETL , QA
• Try data mining
• Learn about Data Science
Update your client tools
• New UI (touch, gestures)
• Click to Query
• New form factors (phone, tablet)
30. Keep Up With Big Data QA
• Learn Big Data Now ( NRIT is a bootcamp training
provider), Learn to write ETL/QA jobs, Query HDFS using
ODBC
• Assume source data is not clean, do upstream ETL and QA by
lookups, reference data sets
• Fact - Hadoop is being used by most of fortune 500
companies now for fast analytics and insights
• Fact - Investment in Hadoop is dependent on BI/analytics in
the end – Obama Election
• FACT - QA matters, garbage in – garbage out is still TRUE!
Presentation: BI/Big Data Futures - Is it really all about the Cloud?In this survey session, SKS will bring you up-to-date on what's happening in the world of enterprise Business Intelligence. BigData, NoSQL, Hadoop, Big Analytics, Cloud Storage, what does all of this mean to you as a data professional? Which products and technologies are mature enough for enterprise adoption and which ones are not? Which vendors should you be trying out and why? What is the reality of hosting enterprise data on the cloud? What are the business reasons to explore these new technologies? How do you learn to implement them?SKS frames this talk with the three major trends that she sees in the Enterprise BI space, highlighting products and technologies that warrant a deeper look.
From the blog - http://www.thisisthegreenroom.com/2011/data-science-vs-business-intelligence/