Indexing big data in the cloud

•

0 likes•678 views

lucenerevolution

Presented by Scott Stults | OpenSource Connections. See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.

Indexing Big Data in the Cloud

Me
Scott Stults
Co-Founder of OpenSource Connections
Solr / Lucene
Bash / Python / Java
2
Indexing Big Data in the Cloud

Eric
3
Indexing Big Data in the Cloud

Big Data
Indexing Big Data in the Cloud 4

Big Data Wrangler
5
Indexing Big Data in the Cloud

How?
Address a Real Project
Be Agile
Make Small Mistaeks Fast
Succeed BIG
6
Indexing Big Data in the Cloud

USPTO Goals
Prototype Search UX
Prove Solr:
Scales
Integrates
Excels
7
Indexing Big Data in the Cloud

Scale?
8
Indexing Big Data in the Cloud

Our Approach
KISS
YAGNI
9
Indexing Big Data in the Cloud
(This space intentionally left blank)

Minimal Flair
10
Indexing Big Data in the Cloud

Record Everything!
11
Indexing Big Data in the Cloud

Some Numbers
Doc Count 1.1 Million
Zip Files 313
Docs per Zip File 4,000
Zip File Size 75M
File Size 300M
12
Indexing Big Data in the Cloud

Testing
Start some servers
Process a batch
Check the clock
13
Indexing Big Data in the Cloud

$start_nodes start_nodes() { ec2-run-instances ami-1b814f72 --block-device-mapping '/dev/sdb=snap-48adde35::true' --block-device-mapping '/dev/sdi1=:10:false' --block-device-mapping '/dev/sdi2=:10:false' --block-device-mapping '/dev/sdi3=:20:false' --instance-type m1.large --key uspto-proto --instance-count $MAX_NODES --group default > ~/run-output } 14 Indexing Big Data in the Cloud$

Gut Check
How fast can we do this?
What can we do in parallel?
15
Indexing Big Data in the Cloud

Scaling
Raise our instance limit
xargs -P GNU parallel
16
Indexing Big Data in the Cloud

Shortcomings
SSH?
Error recovery
One Solr
17
Indexing Big Data in the Cloud

Alternatives
CloudFormation
Puppet / Chef
Multiple Cores / Shards
Hadoop
18
Indexing Big Data in the Cloud

Success
19
Indexing Big Data in the Cloud

Victory Lap
20
Indexing Big Data in the Cloud

Instances / Time
Indexing Big Data in the Cloud 21

Thank You
https://github.com/sstults/patent-indexing
@scottstults
#o19s
22
Indexing Big Data in the Cloud

More Related Content

Viewers also liked

价值规律

Ideeën logo malawi

Herald van Zijl

Change.org + Net2DC Moving the Needle - Online Advocacy Strategies 5-19-16 (1)

Change.org + Net2DC Moving the Needle - Online Advocacy Strategies 5-19-16 (1)

Change.org + Net2DC Moving the Needle - Online Advocacy Strategies 5-19-16 (1)

Creating profiles

Creating profiles

Creating profiles

Projeto social marcelo 14 tp

Projeto social marcelo 14 tp

Projeto social marcelo 14 tp

Its Not About You

Its Not About You

Its Not About You

Grandview Church of the Nazarene

Viewers also liked (6)

价值规律

Ideeën logo malawi

Change.org + Net2DC Moving the Needle - Online Advocacy Strategies 5-19-16 (1)

Change.org + Net2DC Moving the Needle - Online Advocacy Strategies 5-19-16 (1)

Change.org + Net2DC Moving the Needle - Online Advocacy Strategies 5-19-16 (1)

Creating profiles

Creating profiles

Creating profiles

Projeto social marcelo 14 tp

Projeto social marcelo 14 tp

Projeto social marcelo 14 tp

Its Not About You

Its Not About You

Its Not About You

Similar to Indexing big data in the cloud

Presented by Scott Stults | OpenSource Connections - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.

Indexing Big Data on Amazon AWS

Indexing Big Data on Amazon AWS

Indexing Big Data on Amazon AWS

lucenerevolution

Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.

Indexing big data in the cloud

Indexing big data in the cloud

Indexing big data in the cloud

OpenSource Connections

Getting Started with Splunk Breakout Session

Getting Started with Splunk Breakout Session

Getting Started with Splunk Breakout Session

Skymind Open Power Summit ISV Round Table

Skymind Open Power Summit ISV Round Table

Skymind Open Power Summit ISV Round Table

Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...

Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...

Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...

Lucidworks (Archived)

Alluxio Global Online Meetup Apr 23, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speakers: Jiao (Jennie) Wang, Intel Tsai Louie, Intel Bin Fan, Alluxio Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked. Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications. This talk, we will go over: - What is Analytics Zoo and how it works - How to run Analytics Zoo with Alluxio in deep learning applications - Initial performance benchmark results using the Analytics Zoo + Alluxio stack

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio

Sem tech 2011 v8

Sem tech 2011 v8

Sem tech 2011 v8

Data infrastructure architecture for medium size organization: tips for colle...

Data infrastructure architecture for medium size organization: tips for colle...

Data infrastructure architecture for medium size organization: tips for colle...

DataWorks Summit/Hadoop Summit

Introduction to PySpark

Introduction to PySpark

Introduction to PySpark

Tech talk on what Azure Databricks is, why you should learn it and how to get started. We'll use PySpark and talk about some real live examples from the trenches, including the pitfalls of leaving your clusters running accidentally and receiving a huge bill ;) After this you will hopefully switch to Spark-as-a-service and get rid of your HDInsight/Hadoop clusters. This is part 1 of an 8 part Data Science for Dummies series: Databricks for dummies Titanic survival prediction with Databricks + Python + Spark ML Titanic with Azure Machine Learning Studio Titanic with Databricks + Azure Machine Learning Service Titanic with Databricks + MLS + AutoML Titanic with Databricks + MLFlow Titanic with DataRobot Deployment, DevOps/MLops and Operationalization

Databricks for Dummies

Databricks for Dummies

Databricks for Dummies

SQLite and object-relational mapping in Java

SQLite and object-relational mapping in Java

SQLite and object-relational mapping in Java

Customer Presentation - KCP&L

Customer Presentation - KCP&L

Customer Presentation - KCP&L

Cinchcast (aka BlogTalkRadio) is a startup in New York City. Using only a phone, you can broadcast your message globally to millions of listeners. Thousands of broadcasts are happening every day on topics ranging from technology to battling cancer. In this talk, we will discuss how we accomplished this, the technology behind it, and the challenges ahead. We will talk about what it's like building a startup in .NET and the techniques we have used to scale, such as HTML and donut caching, lazy loading of data, elastic search, as well as marrying telephony to the web stack.

You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing It

You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing It

You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing It

Aleksandr Yampolskiy

Datastax / Cassandra Modeling Strategies

Datastax / Cassandra Modeling Strategies

Datastax / Cassandra Modeling Strategies

Anant Corporation

STIX Patterning: Viva la revolución!

STIX Patterning: Viva la revolución!

STIX Patterning: Viva la revolución!

UnConference for Georgia Southern Computer Science March 31, 2015

UnConference for Georgia Southern Computer Science March 31, 2015

UnConference for Georgia Southern Computer Science March 31, 2015

Christopher Curtin

Getting Started with Splunk Breakout Session

Getting Started with Splunk Breakout Session

Getting Started with Splunk Breakout Session

IBM Strategy for Spark

IBM Strategy for Spark

IBM Strategy for Spark

Big Data: an introduction

Big Data: an introduction

Big Data: an introduction

Bart Vandewoestyne

Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio

Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio

Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio

Similar to Indexing big data in the cloud (20)

Indexing Big Data on Amazon AWS

Indexing Big Data on Amazon AWS

Indexing Big Data on Amazon AWS

Indexing big data in the cloud

Indexing big data in the cloud

Indexing big data in the cloud

Getting Started with Splunk Breakout Session

Getting Started with Splunk Breakout Session

Getting Started with Splunk Breakout Session

Skymind Open Power Summit ISV Round Table

Skymind Open Power Summit ISV Round Table

Skymind Open Power Summit ISV Round Table

Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...

Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...

Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio

Sem tech 2011 v8

Sem tech 2011 v8

Sem tech 2011 v8

Data infrastructure architecture for medium size organization: tips for colle...

Data infrastructure architecture for medium size organization: tips for colle...

Data infrastructure architecture for medium size organization: tips for colle...

Introduction to PySpark

Introduction to PySpark

Introduction to PySpark

Databricks for Dummies

Databricks for Dummies

Databricks for Dummies

SQLite and object-relational mapping in Java

SQLite and object-relational mapping in Java

SQLite and object-relational mapping in Java

Customer Presentation - KCP&L

Customer Presentation - KCP&L

Customer Presentation - KCP&L

You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing It

You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing It

You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing It

Datastax / Cassandra Modeling Strategies

Datastax / Cassandra Modeling Strategies

Datastax / Cassandra Modeling Strategies

STIX Patterning: Viva la revolución!

STIX Patterning: Viva la revolución!

STIX Patterning: Viva la revolución!

UnConference for Georgia Southern Computer Science March 31, 2015

UnConference for Georgia Southern Computer Science March 31, 2015

UnConference for Georgia Southern Computer Science March 31, 2015

Getting Started with Splunk Breakout Session

Getting Started with Splunk Breakout Session

Getting Started with Splunk Breakout Session

IBM Strategy for Spark

IBM Strategy for Spark

IBM Strategy for Spark

Big Data: an introduction

Big Data: an introduction

Big Data: an introduction

Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio

Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio

Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio

More from lucenerevolution

Presented by Isabel Drost-Fromm, Software Developer, Apache Software Foundation/Nokia Gate 5 GmbH at Lucene/Solr Revolution 2013 Dublin Text classification automates the task of filing documents into pre-defined categories based on a set of example documents. The first step in automating classification is to transform the documents to feature vectors. Though this step is highly domain specific Apache Mahout provides you with a lot of easy to use tooling to help you get started, most of which relies heavily on Apache Lucene for analysis, tokenisation and filtering. This session shows how to use facetting to quickly get an understanding of the fields in your document. It will walk you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features. Configure

Text Classification Powered by Apache Mahout and Lucene

Text Classification Powered by Apache Mahout and Lucene

Text Classification Powered by Apache Mahout and Lucene

lucenerevolution

Presented by Markus Klose, Search + Big Data Consultant SHI Elektronische Medien GmbH at Lucene/Solr Revolution 2013 Dublin Kibana4Solr is search-driven, scalable, browser based and extremely user friendly (also for non-technical users). Logs are everywhere. Any device, system or human can potentially produce a huge amount of information saved in logs. The amount of available logs and their semi-structured nature make a meaningful processing in real-time quite a difficult task. Thus, valuable business insights stored in logs might be not found. Kibana4Solr is a search-driven approach to handle that challenge. It offers user-friendly and browser-based dashboard which can be easily customized to particular needs. In the session the Kibana4Solr will be introduced. Some light will be shed on the architectural features of Kibana4Solr. Some ideas will be given in terms of possible business uses cases. And finally a live demo of Kibana4Solr will be shown. Configure

State of the Art Logging. Kibana4Solr is Here!

State of the Art Logging. Kibana4Solr is Here!

State of the Art Logging. Kibana4Solr is Here!

lucenerevolution

Search at Twitter

Search at Twitter

Search at Twitter

lucenerevolution

Presented by Daniel Beach, Search Application Developer, OpenSource Connections Solr is a powerful search engine, but creating a custom user interface can be daunting. In this fast paced session I will present an overview of how to implement a client-side search application using Solr. Using open-source frameworks like SpyGlass (to be released in September) can be a powerful way to jumpstart your development by giving you out-of-the box results views with support for faceting, autocomplete, and detail views. During this talk I will also demonstrate how we have built and deployed lightweight applications that are able to be performant under large user loads, with minimal server resources.

Building Client-side Search Applications with Solr

Building Client-side Search Applications with Solr

Building Client-side Search Applications with Solr

lucenerevolution

Presented by Timothy Potter, Founder, Text Centrix Storm is a real-time distributed computation system used to process massive streams of data. Many organizations are turning to technologies like Storm to complement batch-oriented big data technologies, such as Hadoop, to deliver time-sensitive analytics at scale. This talk introduces on an emerging architectural pattern of integrating Solr and Storm to process big data in real time. There are a number of natural integration points between Solr and Storm, such as populating a Solr index or supplying data to Storm using Solr’s real-time get support. In this session, Timothy will cover the basic concepts of Storm, such as spouts and bolts. He’ll then provide examples of how to integrate Solr into Storm to perform large-scale indexing in near real-time. In addition, we'll see how to embed Solr in a Storm bolt to match incoming tuples against pre-configured queries, commonly known as percolator. Attendees will come away from this presentation with a good introduction to stream processing technologies and several real-world use cases of how to integrate Solr with Storm.

Integrate Solr with real-time stream processing applications

Integrate Solr with real-time stream processing applications

Integrate Solr with real-time stream processing applications

lucenerevolution

Configure your Solr cluster to handle hundreds of millions of documents without even noticing, handle queries in milliseconds, use Near Real Time indexing and searching with document versioning. Scale your cluster both horizontally and vertically by using shards and replicas. In this session you'll learn how to make your indexing process blazing fast and make your queries efficient even with large amounts of data in your collections. You'll also see how to optimize your queries to leverage caches as much as your deployment allows and how to observe your cluster with Solr administration panel, JMX, and third party tools. Finally, learn how to make changes to already deployed collections —split their shards and alter their schema by using Solr API.

Scaling Solr with SolrCloud

Scaling Solr with SolrCloud

Scaling Solr with SolrCloud

lucenerevolution

Presented by Rafal Kuć, Consultant and Software engineer, , Sematext Group, Inc. Even though Solr can run without causing any troubles for long periods of time it is very important to monitor and understand what is happening in your cluster. In this session you will learn how to use various tools to monitor how Solr is behaving at a high level, but also on Lucene, JVM, and operating system level. You'll see how to react to what you see and how to make changes to configuration, index structure and shards layout using Solr API. We will also discuss different performance metrics to which you ought to pay extra attention. Finally, you'll learn what to do when things go awry - we will share a few examples of troubleshooting and then dissect what was wrong and what had to be done to make things work again.

Administering and Monitoring SolrCloud Clusters

Administering and Monitoring SolrCloud Clusters

Administering and Monitoring SolrCloud Clusters

lucenerevolution

In a recent project with the United States Patent and Trademark Office, Opensource Connections was asked to prototype the next generation of patent search - using Solr and Lucene. An important aspect of this project was the implementation of BRS, a specialized search syntax used by patent examiners during the examination process. In this fast paced session we will relate our experiences and describe how we used a combination of Parboiled (a Parser Expression Grammar [PEG] parser), Lucene Queries and SpanQueries, and an extension of Solr's QParserPlugin to build BRS search functionality in Solr. First we will characterize the patent search problem and then define the BRS syntax itself. We will then introduce the Parboiled parser and discuss various considerations that one must make when designing a syntax parser. Following this we will describe the methodology used to implement the search functionality in Lucene/Solr. Finally, we will include an overview our syntactic and semantic testing strategies. The audience will leave this session with an understanding of how Solr, Lucene, and Parboiled may be used to implement their own custom search parser.

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled

lucenerevolution

Many of us tend to hate or simply ignore logs, and rightfully so: they’re typically hard to find, difficult to handle, and are cryptic to the human eye. But can we make logs more valuable and more usable if we index them in Solr, so we can search and run real-time statistics on them? Indeed we can, and in this session you’ll learn how to make that happen. In the first part of the session we’ll explain why centralized logging is important, what valuable information one can extract from logs, and we’ll introduce the leading tools from the logging ecosystems everyone should be aware of - from syslog and log4j to LogStash and Flume. In the second part we’ll teach you how to use these tools in tandem with Solr. We’ll show how to use Solr in a SolrCloud setup to index large volumes of logs continuously and efficiently. Then, we'll look at how to scale the Solr cluster as your data volume grows. Finally, we'll see how you can parse your unstructured logs and convert them to nicely structured Solr documents suitable for analytical queries.

Using Solr to Search and Analyze Logs

Using Solr to Search and Analyze Logs

Using Solr to Search and Analyze Logs

lucenerevolution

Enhancing relevancy through personalization & semantic search

Enhancing relevancy through personalization & semantic search

Enhancing relevancy through personalization & semantic search

lucenerevolution

Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr's full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.

Real-time Inverted Search in the Cloud Using Lucene and Storm

Real-time Inverted Search in the Cloud Using Lucene and Storm

Real-time Inverted Search in the Cloud Using Lucene and Storm

lucenerevolution

Like many Web-Applications in the past, the Solr Admin UI up until 4.0 was entirely server based. It used separate code on the server to generate their Dashboards, Overviews and Statistics. All that code had to be maintained and still ... you weren't really able to use that kind of data for the things you needed it for. It was wrapped into HTML, most of the time difficult to extract and changed the structure from time to time w/o announcement. After a short look back, we're going to look into the current state of the Solr Admin UI - a client-side application, running completely in your browser. We'll see how it works, where it gets its data from and how you can get the very same data and wire that into your own custom applications, dashboards and/oder monitoring systems.

Solr's Admin UI - Where does the data come from?

Solr's Admin UI - Where does the data come from?

Solr's Admin UI - Where does the data come from?

lucenerevolution

Schemaless Solr and the Solr Schema REST API

Schemaless Solr and the Solr Schema REST API

Schemaless Solr and the Solr Schema REST API

lucenerevolution

Presented by Renaud Delbru, Co-Founder, SindiceTech In this presentation, we will discuss how Lucene and Solr can be used for very efficient search of tree-shaped schemaless document, e.g. JSON or XML, and can be then made to address both graph and relational data search. We will discuss the capabilities of SIREn, a Lucene/Solr plugin we have developed to deal with huge collections of tree-shaped schemaless documents, and how SIREn is built using Lucene extensibility capabilities (Analysis, Codec, Flexible Query Parser). We will compare it with Lucene's BlockJoin Query API in nested schemaless data intensive scenarios. We will then go through use cases that show how relational or graph data can be turned into JSON documents using Hadoop and Pig, and how this can be used in conjunction with SIREn to create relational faceting systems with unprecedented performance. Take-away lessons from this session will be awareness about using Lucene/Solr and Hadoop for relational and graph data search, as well as the awareness that it is now possible to have relational faceted browsers with sub-second response time on commodity hardware.

High Performance JSON Search and Relational Faceted Browsing with Lucene

High Performance JSON Search and Relational Faceted Browsing with Lucene

High Performance JSON Search and Relational Faceted Browsing with Lucene

lucenerevolution

In this session we will show how to build a text classifier using the Apache Lucene/Solr with libSVM libraries. We classify our corpus of job offers into a number of predefined categories. Each indexed document (a job offer) then belongs to zero, one or more categories. Known machine learning techniques for text classification include naïve bayes model, logistic regression, neural network, support vector machine (SVM), etc. We use Lucene/Solr to construct the features vector. Then we use the libsvm library known as the reference implementation of the SVM model to classify the document. We construct as many one-vs-all svm classifiers as there are classes in our setting, then using the Hadoop MapReduce Framework we reconcile the result of our classifiers. The end result is a scalable multi-class classifier. Finally we outline how the classifier is used to enrich basic solr keyword search.

Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

lucenerevolution

Faceted search is a powerful technique to let users easily navigate the search results. It can also be used to develop rich user interfaces, which give an analyst quick insights about the documents space. In this session I will introduce the Facets module, how to use it, under-the-hood details as well as optimizations and best practices. I will also describe advanced faceted search capabilities with Lucene Facets.

Faceted Search with Lucene

Faceted Search with Lucene

Faceted Search with Lucene

lucenerevolution

Presented by Shai Erera, Researcher, IBM Lucene's arsenal has recently expanded to include two new modules: Index Sorting and Replication. Index sorting lets you keep an index consistently sorted based on some criteria (e.g. modification date). This allows for efficient search early-termination as well as achieve better index compression. Index replication lets you replicate a search index to achieve high-availability, fault tolerance as well as take hot index backups. In this talk we will introduce these modules, discuss implementation and design details as well as best practices.

Recent Additions to Lucene Arsenal

Recent Additions to Lucene Arsenal

Recent Additions to Lucene Arsenal

lucenerevolution

As part of their work with large media monitoring companies, Flax has developed a technique for applying tens of thousands of stored Lucene queries to a document in under a second. We'll talk about how we built intelligent filters to reduce the number of actual queries applied and how we extended Lucene to extract the exact hit positions of matches, the challenges of implementation, and how it can be used, including applications that monitor hundreds of thousands of news stories every day.

Turning search upside down

Turning search upside down

Turning search upside down

lucenerevolution

Presented by Xavier Sanchez Loro, Ph.D, Trovit Search SL This session aims to explain the implementation and use case for spellchecking in Trovit search engine. Trovit is a classified ads search engine supporting several different sites, one for each on country and vertical. Our search engine supports multiple indexes in multiple languages, each with several millions of indexed ads. Those indexes are segmented in several different sites depending on the type of ads (homes, cars, rentals, products, jobs and deals). We have developed a multi-language spellchecking system using solr and lucene in order to help our users to better find the desired ads and avoid the dreaded 0 results as much as possible. As such our goal is not pure orthographic correction, but also suggestion of correct searches for a certain site.

Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...

Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...

Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...

lucenerevolution

Shrinking the haystack wes caldwell - final

Shrinking the haystack wes caldwell - final

Shrinking the haystack wes caldwell - final

lucenerevolution

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene

Text Classification Powered by Apache Mahout and Lucene

Text Classification Powered by Apache Mahout and Lucene

State of the Art Logging. Kibana4Solr is Here!

State of the Art Logging. Kibana4Solr is Here!

State of the Art Logging. Kibana4Solr is Here!

Search at Twitter

Search at Twitter

Search at Twitter

Building Client-side Search Applications with Solr

Building Client-side Search Applications with Solr

Building Client-side Search Applications with Solr

Integrate Solr with real-time stream processing applications

Integrate Solr with real-time stream processing applications

Integrate Solr with real-time stream processing applications

Scaling Solr with SolrCloud

Scaling Solr with SolrCloud

Scaling Solr with SolrCloud

Administering and Monitoring SolrCloud Clusters

Administering and Monitoring SolrCloud Clusters

Administering and Monitoring SolrCloud Clusters

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled

Using Solr to Search and Analyze Logs

Using Solr to Search and Analyze Logs

Using Solr to Search and Analyze Logs

Enhancing relevancy through personalization & semantic search

Enhancing relevancy through personalization & semantic search

Enhancing relevancy through personalization & semantic search

Real-time Inverted Search in the Cloud Using Lucene and Storm

Real-time Inverted Search in the Cloud Using Lucene and Storm

Real-time Inverted Search in the Cloud Using Lucene and Storm

Solr's Admin UI - Where does the data come from?

Solr's Admin UI - Where does the data come from?

Solr's Admin UI - Where does the data come from?

Schemaless Solr and the Solr Schema REST API

Schemaless Solr and the Solr Schema REST API

Schemaless Solr and the Solr Schema REST API

High Performance JSON Search and Relational Faceted Browsing with Lucene

High Performance JSON Search and Relational Faceted Browsing with Lucene

High Performance JSON Search and Relational Faceted Browsing with Lucene

Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Faceted Search with Lucene

Faceted Search with Lucene

Faceted Search with Lucene

Recent Additions to Lucene Arsenal

Recent Additions to Lucene Arsenal

Recent Additions to Lucene Arsenal

Turning search upside down

Turning search upside down

Turning search upside down

Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...

Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...

Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...

Shrinking the haystack wes caldwell - final

Shrinking the haystack wes caldwell - final

Shrinking the haystack wes caldwell - final

Recently uploaded

MS Copilot expands with MS Graph connectors

MS Copilot expands with MS Graph connectors

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

Why Teams call analytics are critical to your entire business

Why Teams call analytics are critical to your entire business

CNIC Information System with Pakdata Cf In Pakistan

CNIC Information System with Pakdata Cf In Pakistan

CNIC Information System with Pakdata Cf In Pakistan

Architecting Cloud Native Applications

Architecting Cloud Native Applications

Architecting Cloud Native Applications

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

Exploring the Future Potential of AI-Enabled Smartphone Processors

Exploring the Future Potential of AI-Enabled Smartphone Processors

Following the popularity of “Cloud Revolution: Exploring the New Wave of Serverless Spatial Data,” we’re thrilled to announce this much-anticipated encore webinar. In this sequel, we’ll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you’re building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Exploring Multimodal Embeddings with Milvus

Exploring Multimodal Embeddings with Milvus

Exploring Multimodal Embeddings with Milvus

AXA XL - Insurer Innovation Award Americas 2024

AXA XL - Insurer Innovation Award Americas 2024

AXA XL - Insurer Innovation Award Americas 2024

The Digital Insurer

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the deployment of external web forms using Jotform for Bonterra Impact Management. This solution can be customized to your organization’s needs and deployed to support the common use cases below: - Intake and consent - Assessments - Surveys - Applications - Program registration Interested in deploying web form automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Jeffrey Haguewood

Manulife - Insurer Transformation Award 2024

Manulife - Insurer Transformation Award 2024

Manulife - Insurer Transformation Award 2024

The Digital Insurer

ICT role in 21st century education and its challenges

ICT role in 21st century education and its challenges

ICT role in 21st century education and its challenges

rafiqahmad00786416

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

FWD Group - Insurer Innovation Award 2024

FWD Group - Insurer Innovation Award 2024

FWD Group - Insurer Innovation Award 2024

The Digital Insurer

In the thrilling conclusion to 2023, ransomware groups had a banner year, really outdoing themselves in the "make everyone's life miserable" department. LockBit 3.0 took gold in the hacking olympics, followed by the plucky upstarts Clop and ALPHV/BlackCat. Apparently, 48% of organizations were feeling left out and decided to get in on the cyber attack action. Business services won the "most likely to get digitally mugged" award, with education and retail nipping at their heels. Hackers expanded their repertoire beyond boring old encryption to the much more exciting world of extortion. The US, UK and Canada took top honors in the "countries most likely to pay up" category. Bitcoins were the currency of choice for discerning hackers, because who doesn't love untraceable money?

Ransomware_Q4_2023. The report. [EN].pdf

Ransomware_Q4_2023. The report. [EN].pdf

Ransomware_Q4_2023. The report. [EN].pdf

Overkill Security

💥 You’re lucky! We’ve found two different (lead) developers that are willing to share their valuable lessons learned about using UiPath Document Understanding! Based on recent implementations in appealing use cases at Partou and SPIE. Don’t expect fancy videos or slide decks, but real and practical experiences that will help you with your own implementations. 📕 Topics that will be addressed: • Training the ML-model by humans: do or don't? • Rule-based versus AI extractors • Tips for finding use cases • How to start 👨‍🏫👨‍💻 Speakers: o Dion Morskieft, RPA Product Owner @Partou o Jack Klein-Schiphorst, Automation Developer @Tacstone Technology

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

UiPathCommunity

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

Apidays New York 2024 - The value of a flexible API Management solution for O...

Apidays New York 2024 - The value of a flexible API Management solution for O...

DBX First Quarter 2024 Investor Presentation

DBX First Quarter 2024 Investor Presentation

DBX First Quarter 2024 Investor Presentation

MINDCTI Revenue Release Quarter One 2024

MINDCTI Revenue Release Quarter One 2024

MINDCTI Revenue Release Quarter One 2024

[BuildWithAI] Introduction to Gemini.pdf

[BuildWithAI] Introduction to Gemini.pdf

[BuildWithAI] Introduction to Gemini.pdf

Recently uploaded (20)

MS Copilot expands with MS Graph connectors

MS Copilot expands with MS Graph connectors

MS Copilot expands with MS Graph connectors

Why Teams call analytics are critical to your entire business

Why Teams call analytics are critical to your entire business

Why Teams call analytics are critical to your entire business

CNIC Information System with Pakdata Cf In Pakistan

CNIC Information System with Pakdata Cf In Pakistan

CNIC Information System with Pakdata Cf In Pakistan

Architecting Cloud Native Applications

Architecting Cloud Native Applications

Architecting Cloud Native Applications

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Exploring the Future Potential of AI-Enabled Smartphone Processors

Exploring the Future Potential of AI-Enabled Smartphone Processors

Exploring the Future Potential of AI-Enabled Smartphone Processors

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Exploring Multimodal Embeddings with Milvus

Exploring Multimodal Embeddings with Milvus

Exploring Multimodal Embeddings with Milvus

AXA XL - Insurer Innovation Award Americas 2024

AXA XL - Insurer Innovation Award Americas 2024

AXA XL - Insurer Innovation Award Americas 2024

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Manulife - Insurer Transformation Award 2024

Manulife - Insurer Transformation Award 2024

Manulife - Insurer Transformation Award 2024

ICT role in 21st century education and its challenges

ICT role in 21st century education and its challenges

ICT role in 21st century education and its challenges

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

FWD Group - Insurer Innovation Award 2024

FWD Group - Insurer Innovation Award 2024

FWD Group - Insurer Innovation Award 2024

Ransomware_Q4_2023. The report. [EN].pdf

Ransomware_Q4_2023. The report. [EN].pdf

Ransomware_Q4_2023. The report. [EN].pdf

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Apidays New York 2024 - The value of a flexible API Management solution for O...

Apidays New York 2024 - The value of a flexible API Management solution for O...

Apidays New York 2024 - The value of a flexible API Management solution for O...

DBX First Quarter 2024 Investor Presentation

DBX First Quarter 2024 Investor Presentation

DBX First Quarter 2024 Investor Presentation

MINDCTI Revenue Release Quarter One 2024

MINDCTI Revenue Release Quarter One 2024

MINDCTI Revenue Release Quarter One 2024

[BuildWithAI] Introduction to Gemini.pdf

[BuildWithAI] Introduction to Gemini.pdf

[BuildWithAI] Introduction to Gemini.pdf

Indexing big data in the cloud

1. Indexing Big Data in the Cloud

2. Me Scott Stults Co-Founder of OpenSource Connections Solr / Lucene Bash / Python / Java 2 Indexing Big Data in the Cloud

3. Eric 3 Indexing Big Data in the Cloud

4. Big Data Indexing Big Data in the Cloud 4

5. Big Data Wrangler 5 Indexing Big Data in the Cloud

6. How? Address a Real Project Be Agile Make Small Mistaeks Fast Succeed BIG 6 Indexing Big Data in the Cloud

7. USPTO Goals Prototype Search UX Prove Solr: Scales Integrates Excels 7 Indexing Big Data in the Cloud

8. Scale? 8 Indexing Big Data in the Cloud

9. Our Approach KISS YAGNI 9 Indexing Big Data in the Cloud (This space intentionally left blank)

10. Minimal Flair 10 Indexing Big Data in the Cloud

11. Record Everything! 11 Indexing Big Data in the Cloud

12. Some Numbers Doc Count 1.1 Million Zip Files 313 Docs per Zip File 4,000 Zip File Size 75M File Size 300M 12 Indexing Big Data in the Cloud

13. Testing Start some servers Process a batch Check the clock 13 Indexing Big Data in the Cloud

14. start_nodes start_nodes() { ec2-run-instances ami-1b814f72 --block-device-mapping '/dev/sdb=snap-48adde35::true' --block-device-mapping '/dev/sdi1=:10:false' --block-device-mapping '/dev/sdi2=:10:false' --block-device-mapping '/dev/sdi3=:20:false' --instance-type m1.large --key uspto-proto --instance-count $MAX_NODES --group default > ~/run-output } 14 Indexing Big Data in the Cloud

15. Gut Check How fast can we do this? What can we do in parallel? 15 Indexing Big Data in the Cloud

16. Scaling Raise our instance limit xargs -P GNU parallel 16 Indexing Big Data in the Cloud

17. Shortcomings SSH? Error recovery One Solr 17 Indexing Big Data in the Cloud

18. Alternatives CloudFormation Puppet / Chef Multiple Cores / Shards Hadoop 18 Indexing Big Data in the Cloud

19. Success 19 Indexing Big Data in the Cloud

20. Victory Lap 20 Indexing Big Data in the Cloud

21. Instances / Time Indexing Big Data in the Cloud 21

22. Thank You https://github.com/sstults/patent-indexing @scottstults #o19s 22 Indexing Big Data in the Cloud