Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know About Its Architecture

OpenSearch
(Just About) Everything You Need to
Know About its Architecture
Seth Muthukaruppan
Consultant, Search Technologies
Instaclustr By NetApp
© Instaclustr Pty Limited, 2021
Data Con LA 2022

Agenda
● OpenSearch Overview
● Use Cases
● Apache Lucene
● OpenSearch Clustering
● Data Organization
● Document Indexing
● Document Searching
● Aggregations

● OpenSearch is a search and analytics engine built with the Apache
Lucene search library
● Extends Lucene to provide a distributed, horizontally scalable, and
highly available search and analytics platform
● OpenSearch is derived from Elasticsearch 7.10.2 and Kibana
7.10.2 from Elastic Co
● OpenSearch is 100% open-source and Apache 2.0 licensed - Free
to view, use, change and distribute the code
● Community driven and maintained by the open-source community
with backing from industry leaders such as Amazon, Red Hat

Enterprise-Grade
Same core features
with advanced add-
ons
100% Open Source
Apache 2.0
Free to view, use,
change and distribute
the code
Community-Driven
Developed and
maintained by open
source community

Search
Extremely fast
Powerful text search
Natural language
Built in analyzers
Fuzzy match
Auto completion
Scalable
Distributed architecture
Horizontally scalable
Thousands of nodes
Petabytes of data
Analytics
Faceting
Aggregations
Built-in reporting
Anomaly detection
Highly Available
Data replication
Zone awareness
Snapshots
Cross cluster replication
Ecosystem
Dashboard
Logstash
Beats
REST Clients
OpenSearch: Core Features

OpenSearch: Building Blocks
OpenSearch
Based on Elasticsearch 7.10.2
Built with Apache Lucene
Elasticsearch wire compatible (7.10.2)
OpenSearch Dashboard
Based on Kibana 7.10.2
Clients
Compatible with Logstash, beats,
REST clients for 7.10 ES
Upgrade Path
Rolling upgrade from ES 7.x(7.10)
Restart upgrade from ES 6.x

OpenSearch: Use Cases

OpenSearch: Use Cases
Log Analysis
● Search for patterns
● Normalize logs
● Correlate logs
● Filter logs
Document Store
● Natural language search
● Search with auto-
correction
● Search as you type
● Synonym search
Analytics
● Historical data
● Trend analysis
● Forecasting
e-Commerce
● Product search
● Product recommendations
● Auto completion
● Stock on hand
● Sales by category

OpenSearch: More Use Cases
Monitoring
● Network
● Hosts
● Sensors
SIEM
● Threat analysis
● Integrity monitoring
● Anomaly detection
● Compliance
APM
● Real-time performance
● Latency, load
● Failures
Time Series
● Machine learning
● Anomaly detection

OpenSearch: Apache Lucene

● Lucene is an open source, high-performance search library built with Java,
● Used by some of the popular search engines such as Apache Solr, Apache
Nutch, OpenSearch, and Elasticsearch
● Lucene uses an inverted search index to achieve incredibly fast search
results
● The inverted search index provides a mapping of terms to documents that
contain those terms
● Lucene supports storing several types of information such as numbers,
strings, and text fields
● Lucene has a rich search interface with support for natural language
searches, wildcard searches, fuzzy, and proximity searches
Apache Lucene: Overview

Apache Lucene: Inverted Index
1 Term Frequency Document
opensearch 1,1,3 1,2,3
search 1 1
analytics 1 1
suite 1 1
alv 1 2
licensed 1 2
includes 1 3
dashboards 1 3
OpenSearch is
a search and
analytics suite
2
OpenSearch is
ALv2 Licensed
3
OpenSearch
includes
OpenSearch
and
OpenSearch
Dashboards

OpenSearch: Cluster

● Lucene is a search library but not a scalable search engine
● OpenSearch uses Lucene at the core for search but has additional
capabilities that make it a full-featured search and analytics engine
● An OpenSearch cluster is a distributed collection of nodes that each perform
one or more cluster operations
● The cluster is horizontally scalable - adding additional nodes allows the
cluster capacity to increase linearly while maintaining similar performance
● With data replication and maintaining data across nodes in the cluster,
OpenSearch can handle node failures with no data loss or downtime
● Nodes in the cluster are differentiated based on the specific functions that
they perform although a node can perform any or all cluster operations
OpenSearch Cluster: Basics

OpenSearch Cluster: Composition
Master
Eligible
Master
Master
Eligible
Client Data Data Data Client

● Master
○ Responsible for maintaining the health and state of the cluster
○ Coordinator for creating, deleting, managing indices and shards
● Master-eligible nodes
○ Candidates master nodes - only one master at any given time
○ An odd number of nodes is required for tie-breaking
● Data nodes
○ Hold the actual data and handle ingestion, search, and aggregation
○ Run CPU and memory-intensive operations
● Client nodes
○ Act as a gateway and help load balance incoming requests
OpenSearch Cluster: Node Types

OpenSearch: Data Organization

Data Organization: Indices
● An Index is the basic unit by which end users manage their data
○ Similar to a collection in a NoSQL database
● Indices contain one or more documents which can be
○ a paragraph from a book
○ a logline
○ a tweet
○ weather data for a city
● Typically similar documents are grouped into the same index
● Indices are internally broken down into multiple sub-indices called shards
● Shards are then directly mapped to Lucene indices

Data Organization: Documents
● Documents are JSON structures that hold a collection of fields and values
○ Fields are key-value pairs that make up a document
○ Fields can be of several different types such as numbers, text,
keywords, geo points, etc
● Typically documents in an index can have similar content

Data Organization: Shards
● An index is broken up into one or more smaller units called shards
● Each shard maps to an underlying Lucene unit called index
○ In other words, each index is mapped to one or more Lucene indices
aka shards
● The number of shards per index is a configurable parameter and has major
implications on the performance of the cluster
● Search operations are performed at the shard level and having multiple
shards help with increasing the search speed
● Increasing the number of shards increases the cluster state information
which means more resources will be needed to manage
○ General practice is to keep the shard size between 30 and 50GB

Data Organization: Primary/Replica Shards
● To guard against data loss, OpenSearch allows configuring replicas
● As the index is stored in shards, configuring replicas cause replica shards to
be created and stored
● OpenSearch tries to allocate replica shards to nodes other than the ones
where the primary shard resides
● Number of replicas is an index-level setting and can be changed at any time
● With replicas, a node failure doesn’t lead to data loss or a data unavailability
○ Data can still be served from the replica copies
● Replica shards come with a price
○ Storing replicas require additional storage space
○ Can slow down indexing as data needs to be indexed in to both primary
and replica shards

OpenSearch: Document Indexing

● Input data to Elasticsearch/OpenSearch is analyzed and tokenized
before it gets stored
○ OpenSearch also stores the original document in a special field
called the _source
● Analyzers and normalizers convert the input fields into a sequence of
terms which then gets stored in the Lucene inverted index
● Pre-built analyzers and normalizers are available for common use cases
○ Standard analyzer breaks text into grammar-based tokens
○ whitespace analyzer breaks text into terms based on whitespace
Document Indexing: Basics

● An analyzer is a combination of character filters, tokenizers, and token
filters
● Custom analyzers can be built using the appropriate set of filters and
tokenizers.
Document Indexing: Analyzers

● Character filters pre-process the input text before forwarding it to the
tokenizer
● They work by adding, removing, or changing characters in the input text
● For example, the built-in HTML strip character filter strips HTML
elements and decodes HTML entities
● Multiple character filters can be specified and they will be applied in
order
Document Indexing: Character Filters

● Tokenizers convert the input stream of characters into tokens based on
certain criteria
○ For instance, the standard tokenizer breaks text into tokens based
on word boundaries and also removes punctuation
○ The whitespace tokenizer breaks text into tokens at whitespaces
● Token filters post-process the tokens from the tokenizer
○ Tokens can be added, removed, or modified
○ For example, the ASCII folding filter will convert Unicode characters
to the closest ASCII equivalent
○ The stemming token filter applies stemming rules to convert words
to their root form
Document Indexing: Tokenizers and Token Filters

● Text Type
○ Primarily used to index human-generated text such as tweets, social
media posts, book contents, product descriptions
○ Text fields are particularly useful for performing phrase queries,
fuzzy queries, etc
● Keyword Type
○ Typically used for indexing structured content such as names, ids,
ISBN, categories, etc
○ Keyword fields are particularly useful for sorting, aggregations and
running scripts as they are stored in a columnar format
Document Indexing: Field Data Types

● Numeric Type
○ Used for numeric data such as integers, unsigned integers, floats
○ When choosing a numeric type, the smallest type that could fit the
input range should be chosen to conserve storage space
○ Numeric fields are stored as BKD trees
● Geo Point Type
○ Geo point is used to represent latitude and longitude data
○ With geo points, queries that rely on location, distance can be
performed
○ BKD fields are used to store geo points.
Document Indexing: More Field Data Types

OpenSearch: Document Searching

● OpenSearch uses a distributed search algorithm to match documents
against the input
● Search can be exact match based such as keyword searches or
relevancy based such as text searches
● OpenSearch focuses more on search speed than accuracy. The level of
required accuracy is typically configurable
● OpenSearch provides a near real-time search whereby all documents
will be available for search except for the most recently indexed
documents that have not been refreshed. By default documents are
refreshed every sec
Document Searching: Basics

● OpenSearch distributed search algorithm uses a query and fetch phase
● Query phase
○ Query sent to all the shards associated with the index. Shards can
be primary or replica
○ Each shard will run the search locally and return the results.
○ Results only contain the document ids, scores, and other relevant
metadata but not the actual document
● Fetch Phase
○ Query results from all the shards are ordered to form the final set of
results.
○ A fetch is performed to rerieve the actual documents from the nodes
Document Searching: Search Phases

● Relevancy searching such as searching for words against text fields
involve scoring to determine which documents are the closese match
● Document scoring is achieved by the similarity module. The default
similarity is BM25
Document Searching: Scoring Algorithm
Score = TF * IDF * Norm
TF: How frequently the given term appears in the field. The higher the number of times the term appears in the
document, the more likely the document is to be relevant.
IDF: How frequently the term appears across all documents in the index. If it appears more commonly, then
the term is less relevant
Norm: Length normalization. For the same term frequency, a shorter field is more relevant than a longer field.

OpenSearch: Aggregations

● OpenSearch is not only a search engine but also have built-in,
advanced analytics capabilities
● Aggregations allow filtering and categorizing documents, calculate
metrics, and build aggregation pipelines by combining multiple
aggregations
● OpenSearch has support for many aggregation types which can be
calssified as
○ Metrics Aggregations
○ Bucket Aggregations
○ Pipeline Aggregations
Aggregations: Basics

● Bucket aggregations categorize the matching set of documents into
buckets based on a bucketing criteria
● Bucketing criteria could be based on unique values of a field (terms
aggregation), date range (date histogram) aggregation, etc
● Bucket aggregations can be used to
○ paginate all buckets (composite aggregation)
○ provide faceting
○ act as inputs to metric aggregations
Aggregations: Bucket Aggregations

● Metrics aggregations calculate metrics on the values generated from the
documents. The values could be specific fields in the documents being
aggregated or generated dynamically through scripts
● They can be included as sub-aggregations to bucket aggregations and
will produce metrics per aggregated bucket
● Numeric metric aggregations produce numeric metrics such as max,
min, sum, and average values
● Some metric aggregations do produce output that are non-numeric. A
good example is the Top hits aggregation. Used as a sub-aggregation, it
produces top matching documents per bucket.
Aggregations: Metric Aggregations

● Pipeline aggregations can be used to compute metrics and they act on
the output of other aggregations making it possible to build a chain of
aggregations
● Pipeline aggregations can be further categorized as parent and sibling
pipeline aggregations
○ Parent pipeline aggregations compute new aggregations based on
the output of the parent aggregation
○ Sibling pipeline aggregations compute new aggregations based on
output from one or more sibling aggregations
Aggregations: Pipeline Aggregations

OpenSearch: Eco System

Seth Muthukaruppan
Seth.Muthukaruppan@netapp.com
linkedin.com/Seth.Muthukaruppan
Questions & Comments

Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know About Its Architecture

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know About Its Architecture

Semelhante a Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know About Its Architecture (20)

Mais de Data Con LA

Mais de Data Con LA (20)

Último

Último (20)

Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know About Its Architecture

Notas do Editor