SlideShare uma empresa Scribd logo
1 de 96
Baixar para ler offline
Leading Big Data Companies (2021)
+ Apache Big Data Stack
By Robert Marcus
Co-Chair of NIST Big Data Public Working Group
Outline of Presentation
Big Data Products
Apache Hadoop Stack
Related Apache Software
NIST Big Data Reference Architecture
Big Data Products
Inspired by an article in the Big Data Quarterly

https://www.dbta.com/BigDataQuarterly/Articles/Big-Data-50-Companies-
Driving-Innovation-in-2021-148749.aspx .
The presentation is purely informative. No endorsement
or validation of company information is implied.
Accenture
Actian
Aerospike
Aerospike
MWC LOS ANGELES 2021.—October 26, 2021—Aerospike Inc., the leader in real-time
data platforms, today announced a partnership with Ably, the edge messaging platform
that powers synchronized digital experiences in real time. The two companies plan to
integrate and jointly market their solutions.

Ably is now a member of the recently expanded Aerospike Accelerate Partner Program.
Using Ably’s suite of APIs, organizations build, extend, and deliver powerful event-driven
applications for millions of concurrently connected devices. The Aerospike Real-time
Data Platform manages data from systems of record all the way out to the edge,
enabling organizations to act in real time across billions of transactions at petabyte
scale.

Together, the companies enable organizations to more quickly bring to market modern
IoT and other edge solutions that require data-intensive, real-time, and high-fidelity
workloads running from the edge to the core. Working with Ably and Aerospike,
enterprises, media companies, and telecommunications carriers solve problems of
intermittent device connectivity, synchronization, and processing of data from millions of
devices. The combined solution simplifies the development and deployment of digital
experiences at global scale — without the need for extensive custom development or a
massive data server infrastructure.
Alluxio
Cloud caching solution >
Zero-copy burst solution >
Faster workloads on object store solution >
Amazon Web Services
Amazon Web Services (cont)
Amazon Web Services (cont)
Amazon Web Services (cont)
Amazon Web Services (cont)
Cambridge Semantics
Cloudera
Cloudian
Cockroach Labs
Collibra Data Intelligence Cloud
Couchbase Capella
Databricks Lakehouse Platform
DataKitchen
DataStax Astra DB
Denodo Platform
Dremio Platform
Franz AllegroGraph
Knowledge Graphs
Gigaspaces InsightEdge
Google Cloud Big Query
Key features
ML and predictive modeling with BigQuery ML
BigQuery ML enables data scientists and data analysts to build and operationalize ML models on planet-scale
structured or semi-structured data, directly inside BigQuery, using simple SQL—in a fraction of the time. Export
BigQuery ML models for online prediction into Vertex AI or your own serving layer. Learn more about the models
we currently support. 

Multicloud data analysis with BigQuery Omni
BigQuery Omni is a flexible, fully managed, multicloud analytics solution that allows you to cost-effectively and
securely analyze data across clouds such as AWS and Azure. Use standard SQL and BigQuery’s familiar interface
to quickly answer questions and share results from a single pane of glass across your datasets. Read more about
our GA launch here. 

Interactive data analysis with BigQuery BI Engine
BigQuery BI Engine is an in-memory analysis service built into BigQuery that enables users to analyze large and
complex datasets interactively with sub-second query response time and high concurrency. BI Engine natively
integrates with Google’s Data Studio, and now in preview, to Looker, Connected Sheets, and all our BI partners
solutions via ODBC/JDBC. Learn more and enroll in BI Engine’s preview. 

Geospatial analysis with BigQuery GIS
BigQuery GIS uniquely combines the serverless architecture of BigQuery with native support for geospatial
analysis, so you can augment your analytics workflows with location intelligence. Simplify your analyses, see
spatial data in fresh ways, and unlock entirely new lines of business with support for arbitrary points, lines,
polygons, and multi-polygons in common geospatial data formats.

View all features
GridGain Nebula
Cloud-Native Service for Apache Ignite
HPE (Hewlett Packard Enterprise) Green Lake
HVR
IBM Big Data Analytics
Data Lake for AI eBook
Big Data Analytics Tools
Explore Data Lakes
Explore IBM Db2 Database
Explore Data Warehouses
Explore Open Source Databases
Immuta Access Control
InfluxData InfluxDB
Informatica Big Data Management
Informatica Big Data Management enables your organization to process large,
diverse, and fast changing data sets so you can get insights into your data. Use
Big Data Management to perform big data integration and transformation without
writing or maintaining external code. 

Use Big Data Management to collect diverse data faster, build business logic in a
visual environment, and eliminate hand-coding to get insights on your data.
Consider implementing a big data project in the following situations:

• The volume of the data that you want to process is greater than 10 terabytes. 

• You need to analyze or capture data changes in microseconds. 

• The data sources are varied and range from unstructured text to social media
data. 

You can perform run-time processing in the native environment or in a non-native
environment. The native environment is the Informatica domain where the Data
Integration Service performs all run-time processing. Use the native run-time
environment to process data that is less than 10 terabytes. A non-native
environment is a distributed cluster outside of the Informatica domain, such as
Hadoop or Databricks, where the Data Integration Service can push run-time
processing. Use a non-native run-time environment to optimize mapping
performance and process data that is greater than 10 terabytes.
IRI Liquid Data
IRI’s data cloud, visualization, applications and private cloud solutions manage all of
your data assets for faster insights and action. The IRI Liquid Data platform is the
industry’s most advanced, most utilized and most imitated end-to-end consumer
planning to activation solution. It comes with hundreds of integrated data sets for use in
our public cloud solution and can be further enriched with client data in a tailored private
cloud environment. It connects data, uncovers relevant patterns and applies the
smartest prescriptive analytics to determine the specific action steps you should take for
growth. 

Liquid Data Connected Enterprise
IRI Liquid Data Connected Enterprise is a self-service cloud solution that enables non-
technical business users to create complex data integrations that run on demand or
automatically on recurring schedules, from every minute to every month. All connected
data sets can instantly be utilized in the platform’s analytic models, business process
applications, visualization or alerting capabilities.

“IRI Liquid Data Connected Enterprise leverages a cutting-edge, federated architecture and
IRI’s high-performance, in-memory database to combat the fragmentation of data in
enterprises,” said Ash Patel, chief information officer for IRI. “The new connected
capabilities enable organizations to combine IRI, partner, third-party and their own first-
party data sets into a single fully integrated analytical and business application platform.”
MariaDB Enterprise
Matillion ETL
Melissa Data Quality Suite
Microsoft Big Data
MongoDB Big Data Architecture
NVIDIA Data Analytics
Ontotext Products
Oracle Big Data
Pure Storage Analytics
Quest Software Foglight
Redis FAQ
Reltio Connected Data Platform
SAP Big Data Reference Architecture
SAS InstituteView of Key Technologies
Semarchy Intelligent Data Hub
Semarchy xDM
SnapLogic Data Science
Software AG Terracotta
REAL-TIME BIG DATA | SOFTWARE AG
Real-time big data offers incredible benefits to the enterprise, promising to help accelerate
decision-making, uncover new opportunities and provide unprecedented breadth of insight.
But working with real-time big data can strain traditional IT resources. When real-time big data
is stored in databases, latency can become a significant issue as the number of users rises to
ever-larger volumes.

That’s where Terracotta In-Memory Data Management from Software AG can help. By
storing real-time big data in-memory, Terracotta provides ultra-fast access to massive data
sets to multiple users on multiple applications.
ULTRA-FAST ACCESS TO REAL-TIME BIG DATA
Software AG’s Terracotta makes massive data sets instantly available in ultra-fast RAM distributed across any size
server array. This real-time big data solution can easily maintain hundreds of terabytes of heterogeneous data in-
memory, with latency guaranteed in the low milliseconds. By accelerating access to real-time big data, Terracotta
accelerates application performance as well as time to insight and allows users to gather, sort and analyze data faster
than the competition. Enterprises can understand customer trends as they are happening, mitigate fast-breaking risk
and enjoy real-time data flows of any type of data to and from any device.

Terracotta enables enterprises to:

• 	
Improve decision-making with faster access to information

• 	
Discover hidden insights and with ultra-fast access and messaging capabilities

• 	
Take advantage of opportunities more quickly to protect and generate new revenue

• 	
Connect to social, Web, mobile and other sources
Snowflake
SQream Refernce Architecture
Swim Products
Syniti Data Migration
Tamr MDM
11 Big Data Blunders
TeradataVantage Analytics
TigerGraph Products
VMwareVSphere
Yellowbrick Cloud Data Warehouse
Apache Hadoop Stack
Apache Hadoop Stack
Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides
an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Hive
Skeptical Criticism of Hive
Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems
such as Amazon S3 filesystem and Alluxio. It provides a SQL-like query language called HiveQL[8] with
schema on read and transparently converts queries to MapReduce, Apache Tez[9] and Spark jobs. All three
execution engines can run in Hadoop's resource negotiator, YARN (Yet Another Resource Negotiator). To
accelerate queries, it provided indexes, but this feature was removed in version 3.0 [10] Other features of
Hive include: 

• Different storage types such as plain text, RCFile, HBase, ORC, and others.

• Metadata storage in a relational database management system, significantly reducing the time to
perform semantic checks during query execution.

• Operating on compressed data stored into the Hadoop ecosystem using algorithms including DEFLATE,
BWT, snappy, etc.

• Built-in user-defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive
supports extending the UDF set to handle use-cases not supported by built-in functions.

• SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.
HCatalog
HCatalog is a table and storage management layer for Hadoop that enables users with different data processing
tools — Pig, MapReduce — to more easily read and write data on the grid. HCatalog’s table abstraction presents
users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not
worry about where or in what format their data is stored — RCFile format, text files, SequenceFiles, or ORC files.

HCatalog supports reading and writing files in any format for which a SerDe (serializer-deserializer) can be
written. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile, and ORC file formats. To use a
custom format, you must provide the InputFormat, OutputFormat, and SerDe.

HCatalog is built on top of the Hive metastore and incorporates Hive's DDL. HCatalog provides read and write interfaces for
Pig and MapReduce and uses Hive's command line interface for issuing data definition and metadata exploration commands. 

HCatalog graduated from the Apache incubator and merged with the Hive project on March 26, 2013.
Map-Reduce
MapReduce is a framework for processing parallelizable problems across large datasets using a large number of
computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar
hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use
more heterogeneous hardware). Processing can occur on data stored either in a filesystem (unstructured) or in a
database (structured). MapReduce can take advantage of the locality of data, processing it near the place it is stored
in order to minimize communication overhead. 

A MapReduce framework (or system) is usually composed of three operations (or steps): 

1. Map: each worker node applies the map function to the local data, and writes the output to a temporary storage.
A master node ensures that only one copy of the redundant input data is processed.

2. Shuffle: worker nodes redistribute data based on the output keys (produced by the map function), such that all
data belonging to one key is located on the same worker node.

3. Reduce: worker nodes now process each group of output data, per key, in parallel.

MapReduce allows for the distributed processing of the map and reduction operations. Maps can be performed in
parallel, provided that each mapping operation is independent of the others; in practice, this is limited by the number
of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform
the reduction phase, provided that all outputs of the map operation that share the same key are presented to the
same reducer at the same time, or that the reduction function is associative. While this process often appears
inefficient compared to algorithms that are more sequential (because multiple instances of the reduction process
must be run), MapReduce can be applied to significantly larger datasets than a single "commodity" server can
handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours.[16] The parallelism
also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper
or reducer fails, the work can be rescheduled – assuming the input data are still available.
Impala
Hive vs Impala
Solr
Kite
Without Kite With Kite
Example
Architecture
Kite is a high-level data layer for Hadoop. It is an API and a set of tools that
speed up development. You configure how Kite stores your data in Hadoop,
instead of building and maintaining that infrastructure yourself.
YARN
The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/
monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application
ApplicationMaster (AM). An application is either a single job or a DAG of jobs.

The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager
is the ultimate authority that arbitrates resources among all the applications in the system. The
NodeManager is the per-machine framework agent who is responsible for containers, monitoring their
resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.

The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with
negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and
monitor the tasks.
Sentry
Apache Sentry is a granular, role-based authorization module for Hadoop. Sentry provides the ability to
control and enforce precise levels of privileges on data for authenticated users and applications on a
Hadoop cluster. Sentry currently works out of the box with Apache Hive, Hive Metastore/HCatalog,
Apache Solr, Impala and HDFS (limited to Hive table data). Sentry is designed to be a pluggable
authorization engine for Hadoop components. It allows you to define authorization rules to validate a
user or application’s access requests for Hadoop resources. Sentry is highly modular and can support
authorization for a wide variety of data models in Hadoop.
RecordService
HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many
similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to
application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable
streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine
project. HDFS is now an Apache Hadoop subproject. The project URL is https://hadoop.apache.org/hdfs/.
Kudu
Table -A table is where your data is stored in Kudu. A table has a schema and a totally ordered primary key. A table is split into segments called tablets.

Tablet -A tablet is a contiguous segment of a table, similar to a partition in other data storage engines or relational databases. A given tablet is replicated on
multiple tablet servers, and at any given point in time, one of these replicas is considered the leader tablet. Any replica can service reads, and writes require
consensus among the set of tablet servers serving the tablet.

Tablet Server - A tablet server stores and serves tablets to clients. For a given tablet, one tablet server acts as a leader, and the others act as follower
replicas of that tablet. Only leaders service write requests, while leaders or followers each service read requests. Leaders are elected using Raft Consensus
Algorithm. One tablet server can serve multiple tablets, and one tablet can be served by multiple tablet servers.

Master -The master keeps track of all the tablets, tablet servers, the Catalog Table, and other metadata related to the cluster. At a given point in time, there
can only be one acting master (the leader). If the current leader disappears, a new master is elected using Raft Consensus Algorithm.The master also
coordinates metadata operations for clients.
Kudu is a columnar storage manager developed for the Apache Hadoop platform. Kudu shares the common technical properties of
Hadoop ecosystem applications: it runs on commodity hardware, is horizontally scalable, and supports highly available operation.
HBase
HBase is an open-source, distributed key-value data storage system and column-oriented database with
high write output and low latency random read performance. By using HBase, we can perform online
real-time analytics. HBase architecture has strong random readability. In HBase, data is sharded
physically into what are known as regions. A single region server hosts each region, and one or more
regions are responsible for each region server. The HBase Architecture is composed of master-slave
servers. The cluster HBase has one Master node called HMaster and several Region Servers called
HRegion Server (HRegion Server). There are multiple regions – regions in each Regional Server.
Sqoop
t
Sqoop
Sqoop is a tool that imports data from relational databases to HDFS and also exports
data from HDFS to relational databases. Moreover, Sqoop can transfer bulk data
efficiently between Hadoop and external data stores such as enterprise data
warehouses, relational databases,etc. Moreover, Sqoop imports data from external
datastores into Hadoop ecosystemtools like Hive & HBase.
Flume
Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data. It has a simple and flexible
architecture based on streaming data flows. It is robust and fault tolerant with
tunable reliability mechanisms and many failover and recovery mechanisms. It uses
a simple extensible data model that allows for online analytic application.
Flume
Kafka
Apache Kafka® is a distributed streaming platform that:

• Publishes and subscribes to streams of records, similar to a message queue or enterprise messaging
system.

• Stores streams of records in a fault-tolerant durable way.

• Processes streams of records as they occur.

Kafka is used for these broad classes of applications:

• Building real-time streaming data pipelines that reliably get data between systems or applications.

• Building real-time streaming applications that transform or react to the streams of data.

Kafka is run as a cluster on one or more servers that can span multiple datacenters. The Kafka cluster stores
streams of records in categories called topics. Each record consists of a key, a value, and a timestamp.
Kafka vs Flume
Related Apache Software
Ambari
The Apache Ambari project is aimed at making Hadoop management simpler by developing
software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari
provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.

Ambari enables System Administrators to:

• Provision a Hadoop Cluster

◦ Ambari provides a step-by-step wizard for installing Hadoop services across any
number of hosts.

◦ Ambari handles configuration of Hadoop services for the cluster.

• Manage a Hadoop Cluster

◦ Ambari provides central management for starting, stopping, and reconfiguring Hadoop
services across the entire cluster.

• Monitor a Hadoop Cluster

◦ Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.

◦ Ambari leverages Ambari Metrics System for metrics collection.

◦ Ambari leverages Ambari Alert Framework for system alerting and will notify you when
your attention is needed (e.g., a node goes down, remaining disk space is low, etc).

Ambari enables Application Developers
and System Integrators to:

• Easily integrate Hadoop provisioning, management, and monitoring capabilities to their
own applications with the Ambari REST APIs.
Avro
Apache Avro™ is a data serialization system.

Avro provides:

• Rich data structures.

• A compact, fast, binary data format.

• A container file, to store persistent data.

• Remote procedure call (RPC).

• Simple integration with dynamic languages. Code generation is not required to read or write
data files nor to use or implement RPC protocols. Code generation as an optional
optimization, only worth implementing for statically typed languages.
Avro provides functionality similar to systems such as Thrift, Protocol Buffers,
etc. Avro differs from these systems in the following fundamental aspects.

• Dynamic typing: Avro does not require that code be generated. Data is
always accompanied by a schema that permits full processing of that
data without code generation, static datatypes, etc. This facilitates
construction of generic data-processing systems and languages.

• Untagged data: Since the schema is present when data is read,
considerably less type information need be encoded with data, resulting
in smaller serialization size.

• No manually-assigned field IDs: When a schema changes, both the old
and new schema are always present when processing data, so
differences may be resolved symbolically, using field names.
Cassandra
Cassandra is a NoSQL distributed database. By design, NoSQL databases are lightweight, open-
source, non-relational, and largely distributed. Counted among their strengths are horizontal
scalability, distributed architectures, and a flexible approach to schema definition.

NoSQL databases enable rapid, ad-hoc organization and analysis of extremely high-volume, disparate
data types. That’s become more important in recent years, with the advent of Big Data and the need
to rapidly scale databases in the cloud. Cassandra is among the NoSQL databases that have
addressed the constraints of previous data management technologies, such as SQL databases.
Chukwa
Apache Chukwa aims to provide a flexible and powerful platform for distributed data collection and rapid data processing. Our goal is
to produce a system that's usable today, but that can be modified to take advantage of newer storage technologies (HDFS appends,
HBase, etc) as they mature. In order to maintain this flexibility, Apache Chukwa is structured as a pipeline of collection and processing
stages, with clean and narrow interfaces between stages. This will facilitate future innovation without breaking existing code

Apache Chukwa has five primary components:
• Adaptors that collect data from various data source.

• Agents that run on each machine and emit data.

• ETL Processes for parsing and archiving the data.

• Data Analytics Scripts for aggregate Hadoop cluster health.

• HICC, the Hadoop Infrastructure Care Center; a web-portal style interface for displaying data.Below is a figure showing Apache Chukwa
data pipeline, annotated with data dwell times at each stage. A more detailed figure is available at the end of this document.
Druid Architecture
Mahout for Machine Learning
Mahout Ecosystem
Mahout Algorithms
Machine Learning Graphic
Oozie
Hadoop is designed to handle big amounts of data from many sources, and to carry out often complicated work
of various types against that data across the cluster. That’s a lot of work, and the best way to get things done is to
be organised with a schedule. That’s what Apache Oozie does. It schedules the work (jobs) in Hadoop.

Oozie enables users to enable multiple different tasks Hadoop, such as map/reduce tasks, pig jobs, sqoop jobs
for moving SQL to Hadoop, etc, into a logical unit of work. This is managed via an Oozie Workflow which is a
Directed Acyclical Graph  (DAG) of these tasks that are to be carried out. The DAG is stored in an XML Process
Definition Language called hPDL.

An Oozie Server is deployed as Java Web Application hosted in a Tomcat server, and all of the stageful
information such as workflow definitions, jobs,  etc, are stored in a database. This database can be either Apache
Derby, HSQL, Oracle,  MySQL, or PostgreSQL. There is an Oozie Client,  which is the client that submits work,
either via a CLI, and API,  or a web service / REST.

The architecture obtained is therefore:
Ozone
Ozone is a scalable, redundant, and distributed object store for Hadoop.
Apart from scaling to billions of objects of varying sizes, Ozone can function
effectively in containerized environments such as Kubernetes and YARN.
Applications using frameworks like Apache Spark, YARN and Hive work
natively without any modifications. Ozone is built on a highly available,
replicated block storage layer called Hadoop Distributed Data Store (HDDS)
From https://blog.cloudera.com/introducing-apache-hadoop-ozone-object-store-apache-hadoop/ 

True to its big data roots, HDFS works best when most of the files are large – tens to hundreds of MBs.
HDFS suffers from the famous small files limitation and struggles with over 400 Million files. There is an
increased demand for an HDFS-like storage system that can scale to billions of small files. Ozone is a
distributed key-value store that can manage both small and large files alike. While HDFS provides
POSIX-like semantics, Ozone looks and behaves like an Object Store.
Pig
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis
programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their
structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs,
for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently
consists of a textual language called Pig Latin, which has the following key properties:

• Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis
tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow
sequences, making them easy to write, understand, and maintain.

• Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics rather than efficiency.

• Extensibility. Users can create their own functions to do special-purpose processing.
From https://data-flair.training/blogs/hadoop-pig-tutorial/
Submarine
Deep learning is useful for enterprises tasks in the field of speech recognition, image classification, AI chatbots,
machine translation, just to name a few. In order to train deep learning/machine learning models, frameworks
such as TensorFlow / MXNet / Pytorch / Caffe / XGBoost can be leveraged. And sometimes these frameworks
are used together to solve different problems. To make distributed deep learning/machine learning applications
easily launched, managed and monitored, Hadoop community initiated the Submarine project along with other
improvements such as first-class GPU support, Docker container support, container-DNS support, scheduling
improvements, etc. These improvements make distributed deep learning/machine learning applications run on
Apache Hadoop YARN as simple as running it locally, which can let machine-learning engineers focus on
algorithms instead of worrying about underlying infrastructure. By upgrading to latest Hadoop, users can now run
deep learning workloads with other ETL/streaming jobs running on the same cluster. This can achieve easy
access to data on the same cluster and achieve better resource utilization.
Zeppelin
Tez
The Apache TEZ® project is aimed at building an application framework which allows for a complex directed-
acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN.
The 2 main design themes for Tez are:

• Empowering end users by:
◦ Expressive dataflow definition APIs

◦ Flexible Input-Processor-Output runtime model

◦ Data type agnostic

◦ Simplifying deployment

• Execution Performance
◦ Performance gains over Map Reduce

◦ Optimal resource management

◦ Plan reconfiguration at runtime

◦ Dynamic physical data flow decisions
By allowing projects like Apache Hive and Apache Pig to run a
complex DAG of tasks, Tez can be used to process data, that
earlier took multiple MR jobs, now in a single Tez job as shown
below.
ZooKeeper
Apache ZooKeeper is basically a distributed coordination service for managing a large set of hosts. Coordinating
and managing the service in the distributed environment is really a very complicated process. Apache ZooKeeper,
with its simple architecture and API, solves this issue. ZooKeeper allows the developer to focus on the core
application logic without being worried about the distributed nature of the application. ZooKeeper framework
provides the complete mechanism for overcoming all the challenges faced by the distributed applications. Apache
Zookeeper handles the race condition and the deadlock by using the fail-safe synchronization approach. It also
handles the inconsistency of data by atomicity.
The various services provided by Apache ZooKeeper are as follows −

• Naming service − This service is for identifying the nodes in the cluster by the name. This service is similar to DNS, but
for nodes.

• Configuration management − This service provides the latest and up-to-date configuration information of a system
for the joining node.

• Cluster management − This service keeps the status of the Joining or leaving of a node in the cluster and the node
status in real-time.

• Leader election − This service elects a node as a leader for the coordination purpose.

• Locking and synchronization service − This service locks the data while modifying it. It helps in automatic fail
recovery while connecting the other distributed applications such as Apache HBase.

• Highly reliable data registry − It offers data availability even when one or a few nodes goes down.
NIST Big Data Reference Architecture
NIST Big Data Architecture

Mais conteúdo relacionado

Mais procurados

Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
Partner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_dataPartner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_data
Treasure Data, Inc.
 

Mais procurados (20)

Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
What’s New in Amazon RDS for Open-Source and Commercial Databases:
What’s New in Amazon RDS for Open-Source and Commercial Databases: What’s New in Amazon RDS for Open-Source and Commercial Databases:
What’s New in Amazon RDS for Open-Source and Commercial Databases:
 
Partner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_dataPartner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_data
 
How to Accelerate the Adoption of AWS and Reduce Cost and Risk with a Data F...
 How to Accelerate the Adoption of AWS and Reduce Cost and Risk with a Data F... How to Accelerate the Adoption of AWS and Reduce Cost and Risk with a Data F...
How to Accelerate the Adoption of AWS and Reduce Cost and Risk with a Data F...
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
Cloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GoogleCloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs Google
 
Deep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksDeep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech Talks
 
Hybrid Cloud Storage: Why HUSCO International Left Traditional Storage Behind
 Hybrid Cloud Storage: Why HUSCO International Left Traditional Storage Behind Hybrid Cloud Storage: Why HUSCO International Left Traditional Storage Behind
Hybrid Cloud Storage: Why HUSCO International Left Traditional Storage Behind
 
Builders Day' - Databases on AWS: The Right Tool for The Right Job
Builders Day' - Databases on AWS: The Right Tool for The Right JobBuilders Day' - Databases on AWS: The Right Tool for The Right Job
Builders Day' - Databases on AWS: The Right Tool for The Right Job
 
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)
 
Replicate and Manage Data Using Managed Databases and Serverless Technologies
Replicate and Manage Data Using Managed Databases and Serverless Technologies Replicate and Manage Data Using Managed Databases and Serverless Technologies
Replicate and Manage Data Using Managed Databases and Serverless Technologies
 
Next-Generation Security Operations with AWS | AWS Public Sector Summit 2016
Next-Generation Security Operations with AWS | AWS Public Sector Summit 2016Next-Generation Security Operations with AWS | AWS Public Sector Summit 2016
Next-Generation Security Operations with AWS | AWS Public Sector Summit 2016
 
AWS re:Invent 2016: High Performance Cinematic Production in the Cloud (MAE304)
AWS re:Invent 2016: High Performance Cinematic Production in the Cloud (MAE304)AWS re:Invent 2016: High Performance Cinematic Production in the Cloud (MAE304)
AWS re:Invent 2016: High Performance Cinematic Production in the Cloud (MAE304)
 
Database Migration Using AWS DMS and AWS SCT (GPSCT307) - AWS re:Invent 2018
Database Migration Using AWS DMS and AWS SCT (GPSCT307) - AWS re:Invent 2018Database Migration Using AWS DMS and AWS SCT (GPSCT307) - AWS re:Invent 2018
Database Migration Using AWS DMS and AWS SCT (GPSCT307) - AWS re:Invent 2018
 
Back Up and Manage On-Premises and Cloud-Native Workloads with Rubrik on AWS ...
Back Up and Manage On-Premises and Cloud-Native Workloads with Rubrik on AWS ...Back Up and Manage On-Premises and Cloud-Native Workloads with Rubrik on AWS ...
Back Up and Manage On-Premises and Cloud-Native Workloads with Rubrik on AWS ...
 
AWS re:Invent 2016: Learn how IFTTT uses ElastiCache for Redis to predict eve...
AWS re:Invent 2016: Learn how IFTTT uses ElastiCache for Redis to predict eve...AWS re:Invent 2016: Learn how IFTTT uses ElastiCache for Redis to predict eve...
AWS re:Invent 2016: Learn how IFTTT uses ElastiCache for Redis to predict eve...
 
Databases on AWS Workshop.pdf
Databases on AWS Workshop.pdfDatabases on AWS Workshop.pdf
Databases on AWS Workshop.pdf
 
AWSome Day - Solutions Architecture Best Practices
AWSome Day - Solutions Architecture Best PracticesAWSome Day - Solutions Architecture Best Practices
AWSome Day - Solutions Architecture Best Practices
 
A Well Architected SaaS - A Holistic Look at Cloud Architecture - Pop-up Loft...
A Well Architected SaaS - A Holistic Look at Cloud Architecture - Pop-up Loft...A Well Architected SaaS - A Holistic Look at Cloud Architecture - Pop-up Loft...
A Well Architected SaaS - A Holistic Look at Cloud Architecture - Pop-up Loft...
 
AWS User Group October
AWS User Group OctoberAWS User Group October
AWS User Group October
 

Semelhante a Big Data Companies and Apache Software

ds_Pivotal_Big_Data_Suite_Product_Suite
ds_Pivotal_Big_Data_Suite_Product_Suiteds_Pivotal_Big_Data_Suite_Product_Suite
ds_Pivotal_Big_Data_Suite_Product_Suite
Robin Fong 方俊强
 
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot ProgramszData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData Inc.
 

Semelhante a Big Data Companies and Apache Software (20)

ds_Pivotal_Big_Data_Suite_Product_Suite
ds_Pivotal_Big_Data_Suite_Product_Suiteds_Pivotal_Big_Data_Suite_Product_Suite
ds_Pivotal_Big_Data_Suite_Product_Suite
 
IBM Cloud pak for data brochure
IBM Cloud pak for data   brochureIBM Cloud pak for data   brochure
IBM Cloud pak for data brochure
 
Big data
Big dataBig data
Big data
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential Tools
 
An Overview of All The Different Databases in Google Cloud
An Overview of All The Different Databases in Google CloudAn Overview of All The Different Databases in Google Cloud
An Overview of All The Different Databases in Google Cloud
 
Comprehensive Guide for Microsoft Fabric to Master Data Analytics
Comprehensive Guide for Microsoft Fabric to Master Data AnalyticsComprehensive Guide for Microsoft Fabric to Master Data Analytics
Comprehensive Guide for Microsoft Fabric to Master Data Analytics
 
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
 
Data Virtualization: Introduction and Business Value (UK)
Data Virtualization: Introduction and Business Value (UK)Data Virtualization: Introduction and Business Value (UK)
Data Virtualization: Introduction and Business Value (UK)
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
 
25 Best Data Mining Tools in 2022
25 Best Data Mining Tools in 202225 Best Data Mining Tools in 2022
25 Best Data Mining Tools in 2022
 
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot ProgramszData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
 
Ανδρέας Τσαγκάρης, 5th Digital Banking Forum
Ανδρέας Τσαγκάρης, 5th Digital Banking ForumΑνδρέας Τσαγκάρης, 5th Digital Banking Forum
Ανδρέας Τσαγκάρης, 5th Digital Banking Forum
 
Infochimps #1 Big Data Platform for the Cloud
Infochimps #1 Big Data Platform for the CloudInfochimps #1 Big Data Platform for the Cloud
Infochimps #1 Big Data Platform for the Cloud
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
 
Crowdstar case-study
Crowdstar case-studyCrowdstar case-study
Crowdstar case-study
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
BlueData DataSheet
BlueData DataSheetBlueData DataSheet
BlueData DataSheet
 
Big Data on Azure Tutorial
Big Data on Azure TutorialBig Data on Azure Tutorial
Big Data on Azure Tutorial
 
Best Bigquery ETL Tool
Best Bigquery ETL ToolBest Bigquery ETL Tool
Best Bigquery ETL Tool
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Big Data Companies and Apache Software

  • 1. Leading Big Data Companies (2021) + Apache Big Data Stack By Robert Marcus Co-Chair of NIST Big Data Public Working Group
  • 2. Outline of Presentation Big Data Products Apache Hadoop Stack Related Apache Software NIST Big Data Reference Architecture
  • 3. Big Data Products Inspired by an article in the Big Data Quarterly https://www.dbta.com/BigDataQuarterly/Articles/Big-Data-50-Companies- Driving-Innovation-in-2021-148749.aspx . The presentation is purely informative. No endorsement or validation of company information is implied.
  • 7. Aerospike MWC LOS ANGELES 2021.—October 26, 2021—Aerospike Inc., the leader in real-time data platforms, today announced a partnership with Ably, the edge messaging platform that powers synchronized digital experiences in real time. The two companies plan to integrate and jointly market their solutions. Ably is now a member of the recently expanded Aerospike Accelerate Partner Program. Using Ably’s suite of APIs, organizations build, extend, and deliver powerful event-driven applications for millions of concurrently connected devices. The Aerospike Real-time Data Platform manages data from systems of record all the way out to the edge, enabling organizations to act in real time across billions of transactions at petabyte scale. Together, the companies enable organizations to more quickly bring to market modern IoT and other edge solutions that require data-intensive, real-time, and high-fidelity workloads running from the edge to the core. Working with Ably and Aerospike, enterprises, media companies, and telecommunications carriers solve problems of intermittent device connectivity, synchronization, and processing of data from millions of devices. The combined solution simplifies the development and deployment of digital experiences at global scale — without the need for extensive custom development or a massive data server infrastructure.
  • 8. Alluxio Cloud caching solution > Zero-copy burst solution > Faster workloads on object store solution >
  • 27. Google Cloud Big Query Key features ML and predictive modeling with BigQuery ML BigQuery ML enables data scientists and data analysts to build and operationalize ML models on planet-scale structured or semi-structured data, directly inside BigQuery, using simple SQL—in a fraction of the time. Export BigQuery ML models for online prediction into Vertex AI or your own serving layer. Learn more about the models we currently support. Multicloud data analysis with BigQuery Omni BigQuery Omni is a flexible, fully managed, multicloud analytics solution that allows you to cost-effectively and securely analyze data across clouds such as AWS and Azure. Use standard SQL and BigQuery’s familiar interface to quickly answer questions and share results from a single pane of glass across your datasets. Read more about our GA launch here. Interactive data analysis with BigQuery BI Engine BigQuery BI Engine is an in-memory analysis service built into BigQuery that enables users to analyze large and complex datasets interactively with sub-second query response time and high concurrency. BI Engine natively integrates with Google’s Data Studio, and now in preview, to Looker, Connected Sheets, and all our BI partners solutions via ODBC/JDBC. Learn more and enroll in BI Engine’s preview. Geospatial analysis with BigQuery GIS BigQuery GIS uniquely combines the serverless architecture of BigQuery with native support for geospatial analysis, so you can augment your analytics workflows with location intelligence. Simplify your analyses, see spatial data in fresh ways, and unlock entirely new lines of business with support for arbitrary points, lines, polygons, and multi-polygons in common geospatial data formats. View all features
  • 29. HPE (Hewlett Packard Enterprise) Green Lake
  • 30. HVR
  • 31. IBM Big Data Analytics Data Lake for AI eBook Big Data Analytics Tools Explore Data Lakes Explore IBM Db2 Database Explore Data Warehouses Explore Open Source Databases
  • 34. Informatica Big Data Management Informatica Big Data Management enables your organization to process large, diverse, and fast changing data sets so you can get insights into your data. Use Big Data Management to perform big data integration and transformation without writing or maintaining external code. Use Big Data Management to collect diverse data faster, build business logic in a visual environment, and eliminate hand-coding to get insights on your data. Consider implementing a big data project in the following situations: • The volume of the data that you want to process is greater than 10 terabytes. • You need to analyze or capture data changes in microseconds. • The data sources are varied and range from unstructured text to social media data. You can perform run-time processing in the native environment or in a non-native environment. The native environment is the Informatica domain where the Data Integration Service performs all run-time processing. Use the native run-time environment to process data that is less than 10 terabytes. A non-native environment is a distributed cluster outside of the Informatica domain, such as Hadoop or Databricks, where the Data Integration Service can push run-time processing. Use a non-native run-time environment to optimize mapping performance and process data that is greater than 10 terabytes.
  • 35. IRI Liquid Data IRI’s data cloud, visualization, applications and private cloud solutions manage all of your data assets for faster insights and action. The IRI Liquid Data platform is the industry’s most advanced, most utilized and most imitated end-to-end consumer planning to activation solution. It comes with hundreds of integrated data sets for use in our public cloud solution and can be further enriched with client data in a tailored private cloud environment. It connects data, uncovers relevant patterns and applies the smartest prescriptive analytics to determine the specific action steps you should take for growth. Liquid Data Connected Enterprise IRI Liquid Data Connected Enterprise is a self-service cloud solution that enables non- technical business users to create complex data integrations that run on demand or automatically on recurring schedules, from every minute to every month. All connected data sets can instantly be utilized in the platform’s analytic models, business process applications, visualization or alerting capabilities. “IRI Liquid Data Connected Enterprise leverages a cutting-edge, federated architecture and IRI’s high-performance, in-memory database to combat the fragmentation of data in enterprises,” said Ash Patel, chief information officer for IRI. “The new connected capabilities enable organizations to combine IRI, partner, third-party and their own first- party data sets into a single fully integrated analytical and business application platform.”
  • 40. MongoDB Big Data Architecture
  • 48. SAP Big Data Reference Architecture
  • 49. SAS InstituteView of Key Technologies
  • 50. Semarchy Intelligent Data Hub Semarchy xDM
  • 52. Software AG Terracotta REAL-TIME BIG DATA | SOFTWARE AG Real-time big data offers incredible benefits to the enterprise, promising to help accelerate decision-making, uncover new opportunities and provide unprecedented breadth of insight. But working with real-time big data can strain traditional IT resources. When real-time big data is stored in databases, latency can become a significant issue as the number of users rises to ever-larger volumes. That’s where Terracotta In-Memory Data Management from Software AG can help. By storing real-time big data in-memory, Terracotta provides ultra-fast access to massive data sets to multiple users on multiple applications. ULTRA-FAST ACCESS TO REAL-TIME BIG DATA Software AG’s Terracotta makes massive data sets instantly available in ultra-fast RAM distributed across any size server array. This real-time big data solution can easily maintain hundreds of terabytes of heterogeneous data in- memory, with latency guaranteed in the low milliseconds. By accelerating access to real-time big data, Terracotta accelerates application performance as well as time to insight and allows users to gather, sort and analyze data faster than the competition. Enterprises can understand customer trends as they are happening, mitigate fast-breaking risk and enjoy real-time data flows of any type of data to and from any device. Terracotta enables enterprises to: • Improve decision-making with faster access to information • Discover hidden insights and with ultra-fast access and messaging capabilities • Take advantage of opportunities more quickly to protect and generate new revenue • Connect to social, Web, mobile and other sources
  • 57. Tamr MDM 11 Big Data Blunders
  • 64. Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
  • 65. Hive Skeptical Criticism of Hive Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem and Alluxio. It provides a SQL-like query language called HiveQL[8] with schema on read and transparently converts queries to MapReduce, Apache Tez[9] and Spark jobs. All three execution engines can run in Hadoop's resource negotiator, YARN (Yet Another Resource Negotiator). To accelerate queries, it provided indexes, but this feature was removed in version 3.0 [10] Other features of Hive include: • Different storage types such as plain text, RCFile, HBase, ORC, and others. • Metadata storage in a relational database management system, significantly reducing the time to perform semantic checks during query execution. • Operating on compressed data stored into the Hadoop ecosystem using algorithms including DEFLATE, BWT, snappy, etc. • Built-in user-defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions. • SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.
  • 66. HCatalog HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored — RCFile format, text files, SequenceFiles, or ORC files. HCatalog supports reading and writing files in any format for which a SerDe (serializer-deserializer) can be written. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile, and ORC file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe. HCatalog is built on top of the Hive metastore and incorporates Hive's DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive's command line interface for issuing data definition and metadata exploration commands. HCatalog graduated from the Apache incubator and merged with the Hive project on March 26, 2013.
  • 67. Map-Reduce MapReduce is a framework for processing parallelizable problems across large datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware). Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of the locality of data, processing it near the place it is stored in order to minimize communication overhead. A MapReduce framework (or system) is usually composed of three operations (or steps): 1. Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of the redundant input data is processed. 2. Shuffle: worker nodes redistribute data based on the output keys (produced by the map function), such that all data belonging to one key is located on the same worker node. 3. Reduce: worker nodes now process each group of output data, per key, in parallel. MapReduce allows for the distributed processing of the map and reduction operations. Maps can be performed in parallel, provided that each mapping operation is independent of the others; in practice, this is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative. While this process often appears inefficient compared to algorithms that are more sequential (because multiple instances of the reduction process must be run), MapReduce can be applied to significantly larger datasets than a single "commodity" server can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours.[16] The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data are still available.
  • 69. Solr
  • 70. Kite Without Kite With Kite Example Architecture Kite is a high-level data layer for Hadoop. It is an API and a set of tools that speed up development. You configure how Kite stores your data in Hadoop, instead of building and maintaining that infrastructure yourself.
  • 71. YARN The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/ monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs. The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
  • 72. Sentry Apache Sentry is a granular, role-based authorization module for Hadoop. Sentry provides the ability to control and enforce precise levels of privileges on data for authenticated users and applications on a Hadoop cluster. Sentry currently works out of the box with Apache Hive, Hive Metastore/HCatalog, Apache Solr, Impala and HDFS (limited to Hive table data). Sentry is designed to be a pluggable authorization engine for Hadoop components. It allows you to define authorization rules to validate a user or application’s access requests for Hadoop resources. Sentry is highly modular and can support authorization for a wide variety of data models in Hadoop.
  • 74. HDFS The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject. The project URL is https://hadoop.apache.org/hdfs/.
  • 75. Kudu Table -A table is where your data is stored in Kudu. A table has a schema and a totally ordered primary key. A table is split into segments called tablets. Tablet -A tablet is a contiguous segment of a table, similar to a partition in other data storage engines or relational databases. A given tablet is replicated on multiple tablet servers, and at any given point in time, one of these replicas is considered the leader tablet. Any replica can service reads, and writes require consensus among the set of tablet servers serving the tablet. Tablet Server - A tablet server stores and serves tablets to clients. For a given tablet, one tablet server acts as a leader, and the others act as follower replicas of that tablet. Only leaders service write requests, while leaders or followers each service read requests. Leaders are elected using Raft Consensus Algorithm. One tablet server can serve multiple tablets, and one tablet can be served by multiple tablet servers. Master -The master keeps track of all the tablets, tablet servers, the Catalog Table, and other metadata related to the cluster. At a given point in time, there can only be one acting master (the leader). If the current leader disappears, a new master is elected using Raft Consensus Algorithm.The master also coordinates metadata operations for clients. Kudu is a columnar storage manager developed for the Apache Hadoop platform. Kudu shares the common technical properties of Hadoop ecosystem applications: it runs on commodity hardware, is horizontally scalable, and supports highly available operation.
  • 76. HBase HBase is an open-source, distributed key-value data storage system and column-oriented database with high write output and low latency random read performance. By using HBase, we can perform online real-time analytics. HBase architecture has strong random readability. In HBase, data is sharded physically into what are known as regions. A single region server hosts each region, and one or more regions are responsible for each region server. The HBase Architecture is composed of master-slave servers. The cluster HBase has one Master node called HMaster and several Region Servers called HRegion Server (HRegion Server). There are multiple regions – regions in each Regional Server.
  • 77. Sqoop t Sqoop Sqoop is a tool that imports data from relational databases to HDFS and also exports data from HDFS to relational databases. Moreover, Sqoop can transfer bulk data efficiently between Hadoop and external data stores such as enterprise data warehouses, relational databases,etc. Moreover, Sqoop imports data from external datastores into Hadoop ecosystemtools like Hive & HBase.
  • 78. Flume Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. Flume
  • 79. Kafka Apache Kafka® is a distributed streaming platform that: • Publishes and subscribes to streams of records, similar to a message queue or enterprise messaging system. • Stores streams of records in a fault-tolerant durable way. • Processes streams of records as they occur. Kafka is used for these broad classes of applications: • Building real-time streaming data pipelines that reliably get data between systems or applications. • Building real-time streaming applications that transform or react to the streams of data. Kafka is run as a cluster on one or more servers that can span multiple datacenters. The Kafka cluster stores streams of records in categories called topics. Each record consists of a key, a value, and a timestamp.
  • 82. Ambari The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. Ambari enables System Administrators to: • Provision a Hadoop Cluster ◦ Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts. ◦ Ambari handles configuration of Hadoop services for the cluster. • Manage a Hadoop Cluster ◦ Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster. • Monitor a Hadoop Cluster ◦ Ambari provides a dashboard for monitoring health and status of the Hadoop cluster. ◦ Ambari leverages Ambari Metrics System for metrics collection. ◦ Ambari leverages Ambari Alert Framework for system alerting and will notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc). Ambari enables Application Developers and System Integrators to: • Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs.
  • 83. Avro Apache Avro™ is a data serialization system. Avro provides: • Rich data structures. • A compact, fast, binary data format. • A container file, to store persistent data. • Remote procedure call (RPC). • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages. Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects. • Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages. • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size. • No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.
  • 84. Cassandra Cassandra is a NoSQL distributed database. By design, NoSQL databases are lightweight, open- source, non-relational, and largely distributed. Counted among their strengths are horizontal scalability, distributed architectures, and a flexible approach to schema definition. NoSQL databases enable rapid, ad-hoc organization and analysis of extremely high-volume, disparate data types. That’s become more important in recent years, with the advent of Big Data and the need to rapidly scale databases in the cloud. Cassandra is among the NoSQL databases that have addressed the constraints of previous data management technologies, such as SQL databases.
  • 85. Chukwa Apache Chukwa aims to provide a flexible and powerful platform for distributed data collection and rapid data processing. Our goal is to produce a system that's usable today, but that can be modified to take advantage of newer storage technologies (HDFS appends, HBase, etc) as they mature. In order to maintain this flexibility, Apache Chukwa is structured as a pipeline of collection and processing stages, with clean and narrow interfaces between stages. This will facilitate future innovation without breaking existing code Apache Chukwa has five primary components: • Adaptors that collect data from various data source. • Agents that run on each machine and emit data. • ETL Processes for parsing and archiving the data. • Data Analytics Scripts for aggregate Hadoop cluster health. • HICC, the Hadoop Infrastructure Care Center; a web-portal style interface for displaying data.Below is a figure showing Apache Chukwa data pipeline, annotated with data dwell times at each stage. A more detailed figure is available at the end of this document.
  • 87. Mahout for Machine Learning Mahout Ecosystem Mahout Algorithms
  • 89. Oozie Hadoop is designed to handle big amounts of data from many sources, and to carry out often complicated work of various types against that data across the cluster. That’s a lot of work, and the best way to get things done is to be organised with a schedule. That’s what Apache Oozie does. It schedules the work (jobs) in Hadoop. Oozie enables users to enable multiple different tasks Hadoop, such as map/reduce tasks, pig jobs, sqoop jobs for moving SQL to Hadoop, etc, into a logical unit of work. This is managed via an Oozie Workflow which is a Directed Acyclical Graph  (DAG) of these tasks that are to be carried out. The DAG is stored in an XML Process Definition Language called hPDL. An Oozie Server is deployed as Java Web Application hosted in a Tomcat server, and all of the stageful information such as workflow definitions, jobs,  etc, are stored in a database. This database can be either Apache Derby, HSQL, Oracle,  MySQL, or PostgreSQL. There is an Oozie Client,  which is the client that submits work, either via a CLI, and API,  or a web service / REST. The architecture obtained is therefore:
  • 90. Ozone Ozone is a scalable, redundant, and distributed object store for Hadoop. Apart from scaling to billions of objects of varying sizes, Ozone can function effectively in containerized environments such as Kubernetes and YARN. Applications using frameworks like Apache Spark, YARN and Hive work natively without any modifications. Ozone is built on a highly available, replicated block storage layer called Hadoop Distributed Data Store (HDDS) From https://blog.cloudera.com/introducing-apache-hadoop-ozone-object-store-apache-hadoop/ True to its big data roots, HDFS works best when most of the files are large – tens to hundreds of MBs. HDFS suffers from the famous small files limitation and struggles with over 400 Million files. There is an increased demand for an HDFS-like storage system that can scale to billions of small files. Ozone is a distributed key-value store that can manage both small and large files alike. While HDFS provides POSIX-like semantics, Ozone looks and behaves like an Object Store.
  • 91. Pig Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. • Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. • Extensibility. Users can create their own functions to do special-purpose processing. From https://data-flair.training/blogs/hadoop-pig-tutorial/
  • 92. Submarine Deep learning is useful for enterprises tasks in the field of speech recognition, image classification, AI chatbots, machine translation, just to name a few. In order to train deep learning/machine learning models, frameworks such as TensorFlow / MXNet / Pytorch / Caffe / XGBoost can be leveraged. And sometimes these frameworks are used together to solve different problems. To make distributed deep learning/machine learning applications easily launched, managed and monitored, Hadoop community initiated the Submarine project along with other improvements such as first-class GPU support, Docker container support, container-DNS support, scheduling improvements, etc. These improvements make distributed deep learning/machine learning applications run on Apache Hadoop YARN as simple as running it locally, which can let machine-learning engineers focus on algorithms instead of worrying about underlying infrastructure. By upgrading to latest Hadoop, users can now run deep learning workloads with other ETL/streaming jobs running on the same cluster. This can achieve easy access to data on the same cluster and achieve better resource utilization. Zeppelin
  • 93. Tez The Apache TEZ® project is aimed at building an application framework which allows for a complex directed- acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN. The 2 main design themes for Tez are: • Empowering end users by: ◦ Expressive dataflow definition APIs ◦ Flexible Input-Processor-Output runtime model ◦ Data type agnostic ◦ Simplifying deployment • Execution Performance ◦ Performance gains over Map Reduce ◦ Optimal resource management ◦ Plan reconfiguration at runtime ◦ Dynamic physical data flow decisions By allowing projects like Apache Hive and Apache Pig to run a complex DAG of tasks, Tez can be used to process data, that earlier took multiple MR jobs, now in a single Tez job as shown below.
  • 94. ZooKeeper Apache ZooKeeper is basically a distributed coordination service for managing a large set of hosts. Coordinating and managing the service in the distributed environment is really a very complicated process. Apache ZooKeeper, with its simple architecture and API, solves this issue. ZooKeeper allows the developer to focus on the core application logic without being worried about the distributed nature of the application. ZooKeeper framework provides the complete mechanism for overcoming all the challenges faced by the distributed applications. Apache Zookeeper handles the race condition and the deadlock by using the fail-safe synchronization approach. It also handles the inconsistency of data by atomicity. The various services provided by Apache ZooKeeper are as follows − • Naming service − This service is for identifying the nodes in the cluster by the name. This service is similar to DNS, but for nodes. • Configuration management − This service provides the latest and up-to-date configuration information of a system for the joining node. • Cluster management − This service keeps the status of the Joining or leaving of a node in the cluster and the node status in real-time. • Leader election − This service elects a node as a leader for the coordination purpose. • Locking and synchronization service − This service locks the data while modifying it. It helps in automatic fail recovery while connecting the other distributed applications such as Apache HBase. • Highly reliable data registry − It offers data availability even when one or a few nodes goes down.
  • 95. NIST Big Data Reference Architecture
  • 96. NIST Big Data Architecture