Seth Muthukaruppan, Consultant at Instacluster
Data Engineering
OpenSearch is an incredibly powerful search engine and analytics suite for ingesting, searching, visualizing, and analyzing your data and it is fully open source. This Apache 2.0-licensed and community-driven collection of technologies harnesses an architecture that combines the powers of Elasticsearch 7.10.2, Kibana 7.10.2 and Apache Lucene. With OpenSearch, users gain a distributed framework featuring particularly powerful scalability, high availability, and database-like capabilities. Attendees at this DataCon LA presentation will come away understanding OpenSearch's architecture and its building-block technology components, including: -- Apache Lucene utilization. Learn how this high-performance Java-based search library utilizes Lucene's inverted search index to delivers incredibly fast search results (while supporting natural language, wildcard, fuzzy, and proximity searches). -- OpenSearch cluster architecture. An OpenSearch cluster is a distributed and horizontally-scalable collection of nodes, which are differentiated based on the operations they perform. Attendees will learn the specific functions of master, master-eligible, data, client, ingest nodes. -- Data organization. Understand how OpenSearch organizes data into indices (which contain documents, which contain fields). -- Internal data structures. Get an in-depth look at how OpenSearch achieves scalability and reliability by breaking up indices into shards and segments, and utilizes translogs. -- Aggregations. See how OpenSearch enables its advanced built-in analytics capabilities through the power of aggregations.
3. ● OpenSearch is a search and analytics engine built with the Apache
Lucene search library
● Extends Lucene to provide a distributed, horizontally scalable, and
highly available search and analytics platform
● OpenSearch is derived from Elasticsearch 7.10.2 and Kibana
7.10.2 from Elastic Co
● OpenSearch is 100% open-source and Apache 2.0 licensed - Free
to view, use, change and distribute the code
● Community driven and maintained by the open-source community
with backing from industry leaders such as Amazon, Red Hat
11. ● Lucene is an open source, high-performance search library built with Java,
● Used by some of the popular search engines such as Apache Solr, Apache
Nutch, OpenSearch, and Elasticsearch
● Lucene uses an inverted search index to achieve incredibly fast search
results
● The inverted search index provides a mapping of terms to documents that
contain those terms
● Lucene supports storing several types of information such as numbers,
strings, and text fields
● Lucene has a rich search interface with support for natural language
searches, wildcard searches, fuzzy, and proximity searches
Apache Lucene: Overview
14. ● Lucene is a search library but not a scalable search engine
● OpenSearch uses Lucene at the core for search but has additional
capabilities that make it a full-featured search and analytics engine
● An OpenSearch cluster is a distributed collection of nodes that each perform
one or more cluster operations
● The cluster is horizontally scalable - adding additional nodes allows the
cluster capacity to increase linearly while maintaining similar performance
● With data replication and maintaining data across nodes in the cluster,
OpenSearch can handle node failures with no data loss or downtime
● Nodes in the cluster are differentiated based on the specific functions that
they perform although a node can perform any or all cluster operations
OpenSearch Cluster: Basics
16. ● Master
○ Responsible for maintaining the health and state of the cluster
○ Coordinator for creating, deleting, managing indices and shards
● Master-eligible nodes
○ Candidates master nodes - only one master at any given time
○ An odd number of nodes is required for tie-breaking
● Data nodes
○ Hold the actual data and handle ingestion, search, and aggregation
○ Run CPU and memory-intensive operations
● Client nodes
○ Act as a gateway and help load balance incoming requests
OpenSearch Cluster: Node Types
19. Data Organization: Indices
● An Index is the basic unit by which end users manage their data
○ Similar to a collection in a NoSQL database
● Indices contain one or more documents which can be
○ a paragraph from a book
○ a logline
○ a tweet
○ weather data for a city
● Typically similar documents are grouped into the same index
● Indices are internally broken down into multiple sub-indices called shards
● Shards are then directly mapped to Lucene indices
20. Data Organization: Documents
● Documents are JSON structures that hold a collection of fields and values
○ Fields are key-value pairs that make up a document
○ Fields can be of several different types such as numbers, text,
keywords, geo points, etc
● Typically documents in an index can have similar content
21. Data Organization: Shards
● An index is broken up into one or more smaller units called shards
● Each shard maps to an underlying Lucene unit called index
○ In other words, each index is mapped to one or more Lucene indices
aka shards
● The number of shards per index is a configurable parameter and has major
implications on the performance of the cluster
● Search operations are performed at the shard level and having multiple
shards help with increasing the search speed
● Increasing the number of shards increases the cluster state information
which means more resources will be needed to manage
○ General practice is to keep the shard size between 30 and 50GB
22. Data Organization: Primary/Replica Shards
● To guard against data loss, OpenSearch allows configuring replicas
● As the index is stored in shards, configuring replicas cause replica shards to
be created and stored
● OpenSearch tries to allocate replica shards to nodes other than the ones
where the primary shard resides
● Number of replicas is an index-level setting and can be changed at any time
● With replicas, a node failure doesn’t lead to data loss or a data unavailability
○ Data can still be served from the replica copies
● Replica shards come with a price
○ Storing replicas require additional storage space
○ Can slow down indexing as data needs to be indexed in to both primary
and replica shards
24. ● Input data to Elasticsearch/OpenSearch is analyzed and tokenized
before it gets stored
○ OpenSearch also stores the original document in a special field
called the _source
● Analyzers and normalizers convert the input fields into a sequence of
terms which then gets stored in the Lucene inverted index
● Pre-built analyzers and normalizers are available for common use cases
○ Standard analyzer breaks text into grammar-based tokens
○ whitespace analyzer breaks text into terms based on whitespace
Document Indexing: Basics
25. ● An analyzer is a combination of character filters, tokenizers, and token
filters
● Custom analyzers can be built using the appropriate set of filters and
tokenizers.
Document Indexing: Analyzers
26. ● Character filters pre-process the input text before forwarding it to the
tokenizer
● They work by adding, removing, or changing characters in the input text
● For example, the built-in HTML strip character filter strips HTML
elements and decodes HTML entities
● Multiple character filters can be specified and they will be applied in
order
Document Indexing: Character Filters
27. ● Tokenizers convert the input stream of characters into tokens based on
certain criteria
○ For instance, the standard tokenizer breaks text into tokens based
on word boundaries and also removes punctuation
○ The whitespace tokenizer breaks text into tokens at whitespaces
● Token filters post-process the tokens from the tokenizer
○ Tokens can be added, removed, or modified
○ For example, the ASCII folding filter will convert Unicode characters
to the closest ASCII equivalent
○ The stemming token filter applies stemming rules to convert words
to their root form
Document Indexing: Tokenizers and Token Filters
28. ● Text Type
○ Primarily used to index human-generated text such as tweets, social
media posts, book contents, product descriptions
○ Text fields are particularly useful for performing phrase queries,
fuzzy queries, etc
● Keyword Type
○ Typically used for indexing structured content such as names, ids,
ISBN, categories, etc
○ Keyword fields are particularly useful for sorting, aggregations and
running scripts as they are stored in a columnar format
Document Indexing: Field Data Types
29. ● Numeric Type
○ Used for numeric data such as integers, unsigned integers, floats
○ When choosing a numeric type, the smallest type that could fit the
input range should be chosen to conserve storage space
○ Numeric fields are stored as BKD trees
● Geo Point Type
○ Geo point is used to represent latitude and longitude data
○ With geo points, queries that rely on location, distance can be
performed
○ BKD fields are used to store geo points.
Document Indexing: More Field Data Types
31. ● OpenSearch uses a distributed search algorithm to match documents
against the input
● Search can be exact match based such as keyword searches or
relevancy based such as text searches
● OpenSearch focuses more on search speed than accuracy. The level of
required accuracy is typically configurable
● OpenSearch provides a near real-time search whereby all documents
will be available for search except for the most recently indexed
documents that have not been refreshed. By default documents are
refreshed every sec
Document Searching: Basics
32. ● OpenSearch distributed search algorithm uses a query and fetch phase
● Query phase
○ Query sent to all the shards associated with the index. Shards can
be primary or replica
○ Each shard will run the search locally and return the results.
○ Results only contain the document ids, scores, and other relevant
metadata but not the actual document
● Fetch Phase
○ Query results from all the shards are ordered to form the final set of
results.
○ A fetch is performed to rerieve the actual documents from the nodes
Document Searching: Search Phases
33. ● Relevancy searching such as searching for words against text fields
involve scoring to determine which documents are the closese match
● Document scoring is achieved by the similarity module. The default
similarity is BM25
Document Searching: Scoring Algorithm
Score = TF * IDF * Norm
TF: How frequently the given term appears in the field. The higher the number of times the term appears in the
document, the more likely the document is to be relevant.
IDF: How frequently the term appears across all documents in the index. If it appears more commonly, then
the term is less relevant
Norm: Length normalization. For the same term frequency, a shorter field is more relevant than a longer field.
35. ● OpenSearch is not only a search engine but also have built-in,
advanced analytics capabilities
● Aggregations allow filtering and categorizing documents, calculate
metrics, and build aggregation pipelines by combining multiple
aggregations
● OpenSearch has support for many aggregation types which can be
calssified as
○ Metrics Aggregations
○ Bucket Aggregations
○ Pipeline Aggregations
Aggregations: Basics
37. ● Bucket aggregations categorize the matching set of documents into
buckets based on a bucketing criteria
● Bucketing criteria could be based on unique values of a field (terms
aggregation), date range (date histogram) aggregation, etc
● Bucket aggregations can be used to
○ paginate all buckets (composite aggregation)
○ provide faceting
○ act as inputs to metric aggregations
Aggregations: Bucket Aggregations
38. ● Metrics aggregations calculate metrics on the values generated from the
documents. The values could be specific fields in the documents being
aggregated or generated dynamically through scripts
● They can be included as sub-aggregations to bucket aggregations and
will produce metrics per aggregated bucket
● Numeric metric aggregations produce numeric metrics such as max,
min, sum, and average values
● Some metric aggregations do produce output that are non-numeric. A
good example is the Top hits aggregation. Used as a sub-aggregation, it
produces top matching documents per bucket.
Aggregations: Metric Aggregations
39. ● Pipeline aggregations can be used to compute metrics and they act on
the output of other aggregations making it possible to build a chain of
aggregations
● Pipeline aggregations can be further categorized as parent and sibling
pipeline aggregations
○ Parent pipeline aggregations compute new aggregations based on
the output of the parent aggregation
○ Sibling pipeline aggregations compute new aggregations based on
output from one or more sibling aggregations
Aggregations: Pipeline Aggregations