O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a navegar o site, você aceita o uso de cookies. Leia nosso Contrato do Usuário e nossa Política de Privacidade.
O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a utilizar o site, você aceita o uso de cookies. Leia nossa Política de Privacidade e nosso Contrato do Usuário para obter mais detalhes.
Talk given for the #phpbenelux user group, March 27th in Gent (BE), with the goal of convincing developers that are used to build php/mysql apps to broaden their horizon when adding search to their site. Be sure to also have a look at the notes for the slides; they explain some of the screenshots, etc.
An accompanying blog post about this subject can be found at http://www.jurriaanpersyn.com/archives/2013/11/18/introduction-to-elasticsearch/
This talk is about adding search to your own website.Implementing a search engine for your own content.First the bit about how it’s done in MySQL, and what problems that brings with it;Then about what a search engine should do for you;Then about how Elastic Search helps;And code of course, lots of code.
I’m leading the development team at Engagor, a startup in the city centreof Gent, Belgium.We’re 2 years old.We build a social media monitoring & management product.Before that I worked for Massive Media for 5 years.As a lead developer I worked on the Netlog & Gatchaproducts.
Search can be a “simple” textsearch.Here I’m searching Tumblr for funny gifs, because that’s what Tumblris for.
But search can go deeper and more into detail too ..Here I’m usingAND, OR, NOTNestingRestrictionson fields
Or very difficult …Searching in a mixed set of dataProfilesPhotosFriend connectionsSearching in a graph …
My first thought when I’d have to add search to a php/mysql site … It sort of works …
Problems arisewhen you have lots of data …To speed things up you add indecesto your MySQL tables.
And the library analogy for a MySQL index is this …An index card box.
MySQL has an index type esp. for full text search.Natural Language:case insensitive, accent incensitiveQuery Expansion:Search for “database” > returns results that has often has words like “mysql”, “oracle”, … A second search with extended query string happens to find related documents too.Boolean ModeAdds operators to Natural Language type
Anyone using a similar system to this?Implemented yourself, or from a CMS?phpBB has/had a table like this.
Example of FULLTEXT MySQL search query.
Example of FULLTEXT MySQL search query in boolean mode. A bit more powerful.
Now, we add restrictions on a certain fields. Now you need combined indeces to keep this fast.
And even more restrictions.Indeces on all involved feeds? In all combinations? In all orders?Lots of indeces make WRITE operations on your tables slower.
MySQL just isn’t built for complex search …So let’s look at what a system built for search needs …
So it’s old.But still active.Used a lot.
It is however a java library. It’s not a fully managed service.
If you have lots of data; you need a search engine to be scalable & highly available.
And that’s where ElasticSearch comes in.
Download, unzip, start.
This time we install it in the right place, and wrapped in a service.
Now it runs on your localhost. As a HTTP service, so open your browser and surf to your Elastic Search server.
HTTP Access, it’s brilliant!Do you want to secure it? Add firewalls …Do you want to cache it? Add Varnish …
Example of adding something to ElasticSearch from your command shell via a HTTP PUT request done by curl.You can do this right after installation. No need to create an index or configure anything, just add data right away.
Adding a second record (document).
Updating an existing record. (Mind the new version number.)
Want to see the record?Surf to it.The url consist ofindex, type & id.
Want to add a new field (column)?Update your document with new field added.
It’s right there.
So it’s schemaless.And actually we did ZERO CONFIGURATION. We didn’t have to create indeces or tell Elastic Search what type of data we’ll be adding.Actually, you can configure a mapping/schema:To require certain fields to be of a certain typeTo avoid text fields of being analyzed (text analysis)Basically: to speed things up …
What we’ve demoed so far is aNoSQL store. That’s cool. But not all.
Here we do a GET request (in the browser) that searches our newly created index for the word “smashing”It returns the 2nd document.
Curl in PHP is simple.Simplest example of how to do a search to elastic from PHP.
I’ve mentioned a few Elastic Search specific terms.Here’s the full breakdown of terminology and how they related to MySQL concepts.
Back to the clustering features Elastic Search adds …
Here you have a Engagor specific dashboard that shows the 12 servers in the Engagor cluster.You see:server12 is the master;That every node knows each other;There are 11K shards;Each with one replica.Health = green means every shard has a replica (on a different server).If one server goes down: no problem.
Now let’s look at how Engagor uses Elastic Search.First, what do we do?From a technical point of view:Engagor = Huge database of social messages.Facebook, Twitter, Instagram, News sites …We save those that are clients are interested in to our application and offer:Statistics about the tracked dataWorkflow toolsAutomation tools
This is the time to show a slide I stole from a presentation from our CEO Folke.We started 2 years ago.I joined after 6 months.Now a team of 16 people.7 of them technical profilesweb developersdata scientistsbackend developersfrontend developersOur customers includeMobistarTelenetMicrosoft EuropeEuropean CommisionAlproSeveral agenciesThey use Engagor for customer support (“call center” software for social media)marketing insightscrisis detection…
Here you see an example of a search on our cluster for a certain twitter user’s handle.You see it returns 260 social messages.Each message has data like the:IdService it’s fromContentDateAuthor details
Here we’re searching in a topic about Belgacom & Mobile Vikings.“All messages from users with at least 1000 followers, that are negative and from Belgium or in Dutch.”On 4 different fields, and nested … It just works.
This is the Query SQL we’re sending to Elastic for the previous search.
Here we are in a topic about Coca Cola, thus high volume.About 50k message per day. 28 days.That’s 1,4M messages we’re searching in.This is a graph of messages per day.
This is the inbox. Showing the last 10 messages in that topic.Performance: about half a second.
Same inbox, but now only showing messages with the word “thirsty”.Performance: again about half a second.(Only 1 sample, so this is not really a benchmark .)
Now, there’s another feature of search engines you might not immediately think about.
Think of it as an equivalent to MySQL GROUP BY query.
The pie chart:Facet on sentiment field of a mention. Returns totals per value.Sentiment Per Day:Facet on combined fields: sentiment + day(dateadd).Returns data used for second chart.You can also use the filter like in the inbox, to see these facets for a filtered set of data. Eg. Sentiment per day for mentions coming from Belgium only.
For the Telenet Twitter profile:Totals of messages per dayPer typeRetweetsMentionsOwn tweetsEvery “segment” (color) is facet with custom filter/search querySingle ElasticSearch call to get all this information.
Percolation example:Use it to route documents …Eg. you have a stream of data coming in and need to decide what to do with those documents based on queries.Everything matching x is for client AEverything matching y is for client B
There’s a good competition going on. Looks like Elastic Search has made Solr alert again; since they’re also focusing on clustering features now.
It’s not because you have a great search engine, that you have great search experience on your site …Users aren’t very good at it …But searching for things is not that easy …
Look at how natural language differs from search query language.
This happens quite often at the Engagor office.Not that we blame our users …It’s just difficult to get it right.
If you want to play with it, and are excited …But you’re boss is all #meh.
ElasticSearch is a mature product.Soundcloud is using it.
StumbleUpon is using it.
Foursquare switched to it …When checking in somewhere, they show:Locations close to youLocations you previously checked inLocations that are popularThat’s a pretty nifty search query right there.
There’s also a real company behind Elastic Search, backing it with:TutorialsTrainingSupportSLA’sConsulting
Recently redesigned site. Documentation is a bit frightening at first, since it’s hard to know what to search for, but I hope this presentation solves that.IRC channel is very active; quick answers.
Already gave an overview of how it’s like to work with the technologies at Engagor.Go to http://startupshizzle.tumblr.com to find out how it’s like to work with the people of our team ;).
So, that’s it.
So, that’s it.
An Introduction to Elastic Search.
Jurriaan Persyn@oemebamo – CTO Engagor
SELECT *FROM myauwesomewebsite WHERE `text` LIKE „%shizzle%‟
TEXT SEARCH WITH PHP/MYSQL• FULLTEXT index • Only for CHAR, VARCHAR & TEXT columns • For MyISAM & InnoDB tables • Configurable Stop Words • Types: • Natural Language • Natural Language with Query Expansion • Boolean Full Text
MYSQL FULLTEXT BOOLEAN MODEOperators: •+ AND •- NOT • OR implied •() Nesting •* Wildcard •“ Literal matches
TEXT SEARCH WITH PHP/MYSQL (CONT‟D)• Typical columns for search table: • Type • Id • Text • Priority• Process: • Blog posts, comments, … • Save (filtered) duplicate of text in search table. • When searching … • Search table and translate to original data via type/idThis is how most php/mysql sites implement their search, right?
SELECT *FROM mysearchtableWHERE MATCH(text) AGAINST („shizzle‟)
SELECT * FROM mysearchtable WHERE MATCH(text) AGAINST („+shizzle –”manizzle”‟ IN BOOLEAN MODE)
SELECT * FROM jobsWHERE role = „DEVELOPER‟AND MATCH(job_description) AGAINST („node.js‟)
SELECT * FROM jobs jJOIN jobs_benefits jb ON j.id = jb.job_idWHERE j.role = „DEVELOPER‟AND (MATCH(job_description) AGAINST („node.js -asp‟ IN BOOLEAN MODE)AND jb.free_espresso = TRUE
WHAT IS A SEARCH ENGINE?• Efficient indexing of data • On all fields / combination of fields• Analyzing data • Text Search • Tokenizing • Stemming • Filtering • Understanding locations • Date parsing• Relevance scoring
TOKENIZING• Finding word boundaries • Not just explode(„ „, $text); • Chinese has no spaces. (Not every single character is a word.)• Understand patterns: • URLs • Emails • #hashtags • Twitter @mentions • Currencies (EUR, €, …)
STEMMING• “Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form.” • Conjugations • Plurals• Example: • Fishing, Fished, Fish, Fisher > Fish • Better > Good• Several ways to find the stem: • Lookup tables • Suffix-stripping • Lemmatization •…• Different stemmers for every language.
FILTERING• Remove stop words • Different for every language• HTML • If you‟re indexing web content, not every character is meaningful.
UNDERSTANDING LOCATIONS• Reverse geocoding of locations to longitude & latitude• Search on location: • Bounding box searches • Distance searches • Searching nearby • Geo Polygons • Searching a country(Note: MySQL also has geospatial indeces.)
RELEVANCE SCORING• From the matched documents, which ones do you show first?• Several strategies: • How many matches in document? • How many matches in document as percentage of length? • Custom scoring algorithms • At index time • At search time • … A combinationThink of Google PageRank.
ELASTICSEARCH• “You know, for Search”• Also Free & Open Source• Built on top of Lucene• Created by Shay Banon @kimchy• Versions • First public release, v0.4 in February 2010 • A rewrite of earlier “Compass” project, now with scalability built-in from the very core • Now stable version at 0.20.6 • Beta branch at 0.90 (working towards 1.0 release)• In Java, so inherently cross-platform
WHAT DOES IT ADD TO LUCENE?• RESTfull Service • JSON API over HTTP • Want to use it from PHP? • CURL Requests, as if you‟d do requests to the Facebook Graph API.• High Availability & Performance • Clustering• Long Term Persistency • Write through to persistent storage system.
$ cd ~/Downloads$ wget https://github.com/…/elasticsearch-0.20.5.tar.gz$ tar –xzf elasticsearch-0.20.5.tar.gz$ cd elasticsearch-0.20.5/$ ./bin/elasticsearch
$ cd ~/Downloads$ wget https://github.com/…/elasticsearch-0.20.5.tar.gz$ tar –xzf elasticsearch-0.20.5.tar.gz$ git clone https://github.com/elasticsearch/elasticsearch-servicewrapper.git elasticsearch-servicewrapper$ sudo mv elasticsearch-0.20.5 /usr/local/share$ cd elasticsearch-servicewrapper$ sudo mv service /usr/local/share/elasticsearch-0.20.5/bin$ cd /usr/local/share$ sudo ln -s elasticsearch-0.20.5 elasticsearch$ sudo chown -R root:wheel elasticsearch$ cd /usr/local/share/elasticsearch$ sudo bin/service/elasticsearch start
$ sudo bin/service/elasticsearch startStarting ElasticSearch...Waiting for ElasticSearch......running: PID:83071$
SCHEMALESS, DOCUMENT ORIENTED• No need to configure schema upfront• No need for slow ALTER TABLE –like operations• You can define a mapping (schema) to customize the indexing process • Require fields to be of certain type • If you want text fields that should not be analyzed (stemming, …)
TERMINOLOGYMySQL Elastic SearchDatabase IndexTable TypeRow DocumentColumn FieldSchema MappingIndex Everything is indexedSQL Query DSLSELECT * FROM table … GET http://…UPDATE table SET … PUT http://…
DISTRIBUTED & HIGHLY AVAILABLE• Multiple servers (nodes) running in a cluster • Acting as single service • Nodes in cluster that store data or nodes that just help in speeding up search queries.• Sharding • Indeces are sharded (# shards is configurable) • Each shard can have zero or more replicas • Replicas on different servers (server pools) for failover • One in the cluster goes down? No problem.• Master • Automatic Master detection + failover • Responsible for distribution/balancing of shards
SCALING ISSUES?• No need for an external load balancer • Since cluster does it‟s own routing. • Ask any server in the cluster, it will delegate to correct node.• What if … • More data > More shards. • More availability > More replicas per shard.
PERFORMANCE TWEAKING• Bulk Indexing• Multi-Get • Avoids network latency (HTTP Api)• Api with administrative & monitoring interface • Cluster‟s availability state • Health • Nodes‟ memory footprint• Alternatives voor HTTP Api? • Java library • PHP wrappers (Sherlock, Elastica, …) • But simplicity of HTTP Api is brilliant to work with, latency is hardly an issue.
Query DSL Example:(language:nl OR location.country:be OR location.country:aa)(tag:sentiment.negative) author.followers:[1000 TO *] (-sub_category:like) ((-status:857.assigned) (-status:857.done))
FACETS• Instead of returning the matching documents …• … return data about the distribution of values in the set of matching documents • Or a subset of the matching documents• Possibilities: • Totals per unique value • Averages of values • Distributions of values •…
TERMINOLOGY (CONT‟D)MySQL Elastic SearchSELECT field, COUNT(*) FacetFROM table GROUP BY field
ADVANCED FEATURES• Nested documents (Child-Parent) • Like MySQL joins?• Percolation Index • Store queries in Elastic • Send it documents • Get returned which queries match• Index Warming • Register search queries that cause heavy load • New data added to index will be warmed • So next time query is executed: pre cached
WHAT ARE MY OTHER OPTIONS?• RDBMS • MySQL, …• NoSQL • MongoDB, …• Search Engines • Solr • Sphinx • Xapian • Lucene itself• SaaS • Amazon CloudSearch
… VS. SOLR•+ • Also built on Lucene • So similar feature set • Also exposes Lucene functionality, like Elastic Search, so easy to extend. • A part of Apache Lucene project • Perfect for Single Server search•- • Clustering is there. But it‟s definitely not as simple as ElasticSearch‟ • Fragmented code base. (Lots of branches.)Engagor used to run on Solr.
… VS. SPHINX•+ • Great for single server full text searches; • Has graceful integration with SQL database; • (Eg. for indexing data) • Faster than the others for simple searches;•- • No out of the box clustering; • Not built on Lucene; lacks some advanced features;Netlog & Twoo use Sphinx.
WANT TO USE IT?• In an existing project: • As an extra layer next to your data … • Send to both your database & elasticsearch; • Consistency problems?; • Or as replacement for database • Elastic is as persistent as MySQL; • If you don‟t need RDBMS features; • @Engagor: Our social messages are only in Elastic
“Users are incredibly bad atfinding and researching thingson the web.” Nielsen (March 2013) http://www.nngroup.com/articles/search-navigation/
“Pathetic and useless arewords that come to mind afterthis year‟s user testing.” Nielsen (March 2013) http://www.nngroup.com/articles/search-navigation/