Introduction to Big Data

What is Big Data?
• Huge Amount of Data (Terabytes or Petabytes)
• Big data is the term for a collection of data sets
so large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications
• The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and
visualization
LET’S SOLVE THIS
PROBLEM BY USING THE
BIG DATA NONE OF US
HAVE THE SLIGHTEST
IDEA WHAT TO DO WITH

What is Big Data? (Cont’d)
• Every day we roughly create 2.5 Quintillion bytes of data;
90% of the worlds collected data has been generated only
in the last 2 years
• Data sizes are now in Peta-bytes, Tera-bytes, Exa-bytes &
Zeta-bytes
• Data is of various types – user content (posts, tweets,)
images, audio and video files, chat logs, machine
generated data
• The number of speed of data generation has increased –
every data we generate 2,500,000,000,000,000 (2.5
quintillion) bytes of data EVERY DAY
• The data sources have multiplied many times over – social
media, banking, government, e-commerce etc.

Purpose of Big Data or Why Big Data?
• In 2015, we created data more than all of past years data combined. But 90% of our created data is not
sorted.
• Big Data gives us better and different picture of data collected. But the purpose is not solved yet. We need
to find out a better way of using it as well.
• And for all these you need the best data scientist you can possible get your hands on.
• Companies love data because it provided very lucrative insights to businesses and their clients - Data today
is the most important commodity. If companies can harness and analyse data, it provides an unmatchable
competitive advantage
• New age tools like Hadoop have made data handling and processing easier and cheaper
• Cloud computing, hardware and memory is getting cheaper so data storage is not a problem
• New analytical software provides real time analysis of data for business decisions

Big Data Market Trend
• Market research firm IDC forecasts a 50% increase in revenues from the sale of big data and
business analytics software, hardware, and services between 2015 and 2019.
• It says, services will account for the biggest chunk of revenue, with banking and
manufacturing-led industries poised to spend the most.
• By 2019, IDC said it expects revenue generated by the US market for big data and business
analytics solutions to exceed $98 billion.
• Edge analytics is the next big thing in the Big Data technology for companies that want to
gain real-time insights and impact business through IoT use cases. will allow companies to
derive the most actionable value from their data.
• The key trend in Big Data is the transition of analytics solutions into the cloud. The cloud
enables vast amount of computing resources to be applied to data analysis and to scale that
computing based on need.

Big Data Learning Challenges
• Upgradation: In every 3 to 6 months a new version of technology is introduced. So by the time you have actually
learned something, a new technology would have emerged making your knowledge old and outdated.
• Data Extraction: Challenge is to extract the most important information out of the massive data formation.
• Lack of Talent: A successful implementation of Big Data project requires a sophisticated team of developers, data
scientists and analysts having a sufficient knowledge of Big Data. But the available skill is very limited in number.
• Data Quality: United State’s cost of procuring dirty data every year is $600 billion. The common causes for these
data are- user input errors, data duplication & incorrect data linking.

Metrics of Data Size
Till date, we’ve only been familiar with data in GB. The world in changing now :
Data Metrics Hierarchy
1024 MEGABYTES = 1 GB
1024 GIGABYTES = 1 TB
1024 TERABYTES = 1 PB
1024 PETABYTES = 1 EB
1024 EXABYTES = 1 ZETABYTES
1024 KILOBYTES = 1 MB
1024 BYTES = 1 KB

Why Has Data Generation Increased?
The advent of the internet is the primary reason for the data explosion that has taken place over the last 15 years
• Facebook - more than 2.5 million pieces of content
• Twitter - 300 thousand tweets at Twitter
• Instagram - 250,000 new pictures
• YouTube - 75 hours of new video content
• Email - 400 million messages
• WhatsApp - 400,000 pictures
• Google - 5 million search requests
Every minute:

Other sources of Data generation
• E-commerce – More than1.3 billion transactions per day
• Data logging devices – Wearables (ex: FitBit) Healthcare monitoring, GPS systems, Data sensors
• Financial Data – Stock exchanges, Banking transactions
• Aviation Industry – A typical flight generates half a TB of data
• Governments – Citizen data, Tax records etc.

Companies Want More Data !!
• Companies love data because it provided very lucrative insights to businesses and their clients - Data today is
the most important commodity. If companies can harness and analyze data, it provides an unmatchable
competitive advantage
• New age tools like Hadoop have made data handling and processing easier and cheaper
• Cloud computing, hardware and memory is getting cheaper so data storage is not a problem
• New analytical software provides real time analysis of data for business decisions

Characteristics of Big Data
• Big Data is characterized by 4 V’s: Volume, Velocity, Variety and Veracity
BigData
Volume
Velocity
Variety
Veracity

Characteristics of Big Data – Volume
Volume: Refers to the enormous volumes of data
Data Warehouses

Characteristics of Big Data – Velocity
Velocity: Itrefers topace at which data is being generated, processed and consumed

Characteristics of Big Data – Variety
Variety: Data can be gathered from infinite sources

Characteristics of Big Data – Veracity
Veracity: The quality and authenticity of the data being captured can vary greatly

Scope: Testing Aspects in Big Data
• Validation of Structured and Unstructured Data: Data needs to be classified as the structured
and unstructured parts.
(i) Structured Data: It is the data which can be stored in the form of tables (rows and columns) without any
processing for example database, call details and excel sheets.
(ii) Unstructured Data: It is the data which does not have a predefined data model or structure for example data
in the form of weblogs, audio, tweets, and comments.
Adequate time needs to be spent over the validation of the data at an initial stage, and it is the point
where we encounter an abundance of bad data from various sources.
• Execution of Non-Functional Testing: Non-functional testing plays a vital role in ensuring the
scalability of the process. Functional testing focuses on the coding and requirement related issues
whereas non-functional testing classifies the performance bottlenecks and validates the non-
functional requirements.
• Handling Non-Relational Databases: Non-Relational databases form the backbone of the Big Data
storage. Since these are the main sources of data retrieval hence require a good portion of testing to maintain
the accuracy of the system. Commonly known as NoSQL databases, these DBs are designed in such a manner
that they can early handle the Big Data and are different from the traditional RDBMS which are designed on
table/key model.

• Ace Test Environment: Efficient test environment ensures that data from multiple sources is
of acceptable quality for accurate analysis. Although replicating the complete set of big data
into the test environment is next to impossible, so a small subset of the data is created for
the test environment to verify the behavior. Careful planning is required to exercise all paths
with subsets of data in a manner that fully verifies the application.

Phases of testing in Big Data and Hadoop
Testing of Big Data and Hadoop is an enormous and complex process which is segregated into
four phases to squeeze out the best results from testing. These phases are as follows:
1. Pre-Hadoop Processing: It includes the validation of the data which is collated from various sources
before Hadoop processing. This is the phase where we get rid of unwanted data.
2. Processing of Map Reduce Job: Map R job in Hadoop is the java code which is used to fetch out
the data according to the preconditions provided. Verification of the Map Reduce job is performed to
monitor the accuracy of the data fetched.
3. Data Extraction and Loading: This process of includes the validation of the data being loaded and
extracted from the HDFS (Hadoop Distributed File System) to ensure that no corrupt data occupied in
the HDFS.
4. Report Validation: This is the last phase of testing to ensure that the output which we are delivering
is meeting the accuracy standards, and there is no redundant data present in the reports.

Phases of testing in Big data and Hadoop

Become a Big Data expert with Hortonworks certification only at SpringPeople!
Get updates on upcoming classes and webinars at www.springpeople.com

Introduction to Big Data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (13)

Semelhante a Introduction to Big Data

Semelhante a Introduction to Big Data (20)

Mais de SpringPeople

Mais de SpringPeople (20)

Último

Último (20)

Introduction to Big Data

Notas do Editor