SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
Indexing all the things:
Building your search
engine in Python
Joe Cabrera
@greedoshotlast
Joe Cabrera
Hi, I’m
● Senior Backend Engineer at Jopwell
● Python Programmer since 2009
● Building scalable search and backend
systems for about 2 years
● Author of various open source Python
projects
Our database setup
Trying to find Carmen Sandiego in SQL
● We could start by using LIKE with wildcards ~ 91 sec / 1M rec, low accuracy
SELECT * FROM profile JOIN profile_location JOIN location WHERE
first_name like ‘%Carmen%’ and last_name like ‘%Sandiego%’
● But wait we could also use full-text search ~ 8 min / 1 M rec, higher accuracy
SELECT * FROM profiles JOIN profile_location JOIN location WHERE
first_name || ‘ ‘ || last_name @@ ‘Carmen Sandiego’
Great, but...
● MySQL has very limited support for full-text search
● Custom features may not be supported if you are using Postgres RDS
● You start getting lots of long custom SQL queries
● We’ll going to have to manage our own database sharding
Enter Elasticsearch
● Built on-top of the Lucene search library
● Designed to be distributed
● Full-text indexing and search engine
● Features a common interface: JSON over HTTP
{
"doc" : {
"first_name": "Carmen",
"last_name": "Sandiego",
"locations": [
"New York",
"London",
"Tangier"
],
"location_id": [
1,
2,
3
]
}
}
def index_single_doc(field_names, profile):
index = {}
for field_name in field_names:
field_value = getattr(profile, field_name)
index[field_name] = field_value
return index
Flattening our documents
location_names = []
location_ids = []
for p in profile.locations.all():
location_names.append(str(p))
location_ids.append(p.id)
What about data in related tables?
Indexing our document into Elasticsearch
def add_doc(self, data, id=doc_id):
es_instance = Elasticsearch('https://my_elasticsearchserver')
es_instance.index(index='my-index', doc_type='db-text',id=doc_id, body=data, refresh=True)
Getting the data back out of Elasticsearch
● We’ll first need to perform our query to Elasticsearch
● Then grab the doc ids from the search results
● Use the doc ids to load the profiles from our database for the final search
result response
query_json = {'query': {'simple_query_string': {'query': 'Carmen Sandiego',
'fields':['first_name', 'last_name']}}}
es_results = es_instance.search(index=self.index,
body=query_json,
size=limit,
from_=offset)
Performing our query
{
"took" : 63,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "my-new-index",
"_type" : "db-text",
"_id" : "1",
"sort": [0],
"_score" : null,
"_source": {"first_name": "Carmen", "last_name":"Sandiego","locations": ["New York",
"London", "Tangier"], "location_id": [1, 2, 3]}
}
]
}
}
search_results = []
for _id in raw_ids:
try:
search_results.append(Profile.objects.get(pk=_id))
except:
pass
return search_results
Populating the search results
How do we make Elasticsearch production ready?
Using celery to distribute the task of indexing
● Celery is a distributed task queuing system
● Since indexing is a memory-bound task we don’t want it tying up server
resources
● We’ll break up the task of indexing every one of our new documents initially
into Elasticsearch into a separate task controlled by a larger master task
● New documents can be added incremental to our existing index by firing off a
separate task
from celery import group, task
@task
def index_all_docs():
...
group(process_doc.si(profile_id) for profile_id in profile_ids)()
@task
def process_doc(profile_id):
How do we keep these datastores in sync?
def save(self, *args, **kwargs):
super(Profile, self).save(*args, **kwargs)
celery.current_app.send_task('search_indexer.add_doc'
(self.id,))
Syncing data to Elasticsearch
Great, but what about partial updates?
def update_doc(self, doc_id, data):
es_instance = Elasticsearch('https://my_elasticsearchserver')
es_instance.update(index='my-new-index', doc_type='db-text', id=doc_id,
body={'doc': json.loads(data)}, refresh=True)
Resources
● Code examples from today - http://bit.ly/python-search
● Elasticsearch-py - https://github.com/elastic/elasticsearch-py
● Elasticsearch official docs - https://www.elastic.co/guide/index.html
● Celery - https://github.com/celery/celery/
Thank you!
watch @greedoshotlast for these slides

Mais conteúdo relacionado

Semelhante a Indexing all the things: Building your search engine in python

Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overviewAmit Juneja
 
Building Services With gRPC, Docker and Go
Building Services With gRPC, Docker and GoBuilding Services With gRPC, Docker and Go
Building Services With gRPC, Docker and GoMartin Kess
 
Elasticsearch for Data Engineers
Elasticsearch for Data EngineersElasticsearch for Data Engineers
Elasticsearch for Data EngineersDuy Do
 
Python RESTful webservices with Python: Flask and Django solutions
Python RESTful webservices with Python: Flask and Django solutionsPython RESTful webservices with Python: Flask and Django solutions
Python RESTful webservices with Python: Flask and Django solutionsSolution4Future
 
Database madness with_mongoengine_and_sql_alchemy
Database madness with_mongoengine_and_sql_alchemyDatabase madness with_mongoengine_and_sql_alchemy
Database madness with_mongoengine_and_sql_alchemyJaime Buelta
 
High Performance Django
High Performance DjangoHigh Performance Django
High Performance DjangoDjangoCon2008
 
High Performance Django 1
High Performance Django 1High Performance Django 1
High Performance Django 1DjangoCon2008
 
Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB
 Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB
Conceptos básicos. Seminario web 2: Su primera aplicación MongoDBMongoDB
 
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPPHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPiMasters
 
Find Anything In Your APEX App - Fuzzy Search with Oracle Text
Find Anything In Your APEX App - Fuzzy Search with Oracle TextFind Anything In Your APEX App - Fuzzy Search with Oracle Text
Find Anything In Your APEX App - Fuzzy Search with Oracle TextCarsten Czarski
 
Back to Basics 2017: Mí primera aplicación MongoDB
Back to Basics 2017: Mí primera aplicación MongoDBBack to Basics 2017: Mí primera aplicación MongoDB
Back to Basics 2017: Mí primera aplicación MongoDBMongoDB
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)Ki Sung Bae
 
The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)Ki Sung Bae
 
Rapid web development, the right way.
Rapid web development, the right way.Rapid web development, the right way.
Rapid web development, the right way.nubela
 
Building LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDBBuilding LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDBMongoDB
 
Building Better Applications with Data::Manager
Building Better Applications with Data::ManagerBuilding Better Applications with Data::Manager
Building Better Applications with Data::ManagerJay Shirley
 

Semelhante a Indexing all the things: Building your search engine in python (20)

Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
 
Php
PhpPhp
Php
 
Building Services With gRPC, Docker and Go
Building Services With gRPC, Docker and GoBuilding Services With gRPC, Docker and Go
Building Services With gRPC, Docker and Go
 
Elasticsearch for Data Engineers
Elasticsearch for Data EngineersElasticsearch for Data Engineers
Elasticsearch for Data Engineers
 
Python RESTful webservices with Python: Flask and Django solutions
Python RESTful webservices with Python: Flask and Django solutionsPython RESTful webservices with Python: Flask and Django solutions
Python RESTful webservices with Python: Flask and Django solutions
 
Database madness with_mongoengine_and_sql_alchemy
Database madness with_mongoengine_and_sql_alchemyDatabase madness with_mongoengine_and_sql_alchemy
Database madness with_mongoengine_and_sql_alchemy
 
High Performance Django
High Performance DjangoHigh Performance Django
High Performance Django
 
High Performance Django 1
High Performance Django 1High Performance Django 1
High Performance Django 1
 
Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB
 Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB
Conceptos básicos. Seminario web 2: Su primera aplicación MongoDB
 
Php
PhpPhp
Php
 
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPPHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
 
Find Anything In Your APEX App - Fuzzy Search with Oracle Text
Find Anything In Your APEX App - Fuzzy Search with Oracle TextFind Anything In Your APEX App - Fuzzy Search with Oracle Text
Find Anything In Your APEX App - Fuzzy Search with Oracle Text
 
Elastic tire demo
Elastic tire demoElastic tire demo
Elastic tire demo
 
Back to Basics 2017: Mí primera aplicación MongoDB
Back to Basics 2017: Mí primera aplicación MongoDBBack to Basics 2017: Mí primera aplicación MongoDB
Back to Basics 2017: Mí primera aplicación MongoDB
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)
 
The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)The art of readable code (ch1~ch4)
The art of readable code (ch1~ch4)
 
Rapid web development, the right way.
Rapid web development, the right way.Rapid web development, the right way.
Rapid web development, the right way.
 
Building LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDBBuilding LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDB
 
Building Better Applications with Data::Manager
Building Better Applications with Data::ManagerBuilding Better Applications with Data::Manager
Building Better Applications with Data::Manager
 

Último

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Último (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Indexing all the things: Building your search engine in python

  • 1. Indexing all the things: Building your search engine in Python Joe Cabrera @greedoshotlast
  • 2. Joe Cabrera Hi, I’m ● Senior Backend Engineer at Jopwell ● Python Programmer since 2009 ● Building scalable search and backend systems for about 2 years ● Author of various open source Python projects
  • 4. Trying to find Carmen Sandiego in SQL ● We could start by using LIKE with wildcards ~ 91 sec / 1M rec, low accuracy SELECT * FROM profile JOIN profile_location JOIN location WHERE first_name like ‘%Carmen%’ and last_name like ‘%Sandiego%’ ● But wait we could also use full-text search ~ 8 min / 1 M rec, higher accuracy SELECT * FROM profiles JOIN profile_location JOIN location WHERE first_name || ‘ ‘ || last_name @@ ‘Carmen Sandiego’
  • 5. Great, but... ● MySQL has very limited support for full-text search ● Custom features may not be supported if you are using Postgres RDS ● You start getting lots of long custom SQL queries ● We’ll going to have to manage our own database sharding
  • 6. Enter Elasticsearch ● Built on-top of the Lucene search library ● Designed to be distributed ● Full-text indexing and search engine ● Features a common interface: JSON over HTTP
  • 7. { "doc" : { "first_name": "Carmen", "last_name": "Sandiego", "locations": [ "New York", "London", "Tangier" ], "location_id": [ 1, 2, 3 ] } }
  • 8. def index_single_doc(field_names, profile): index = {} for field_name in field_names: field_value = getattr(profile, field_name) index[field_name] = field_value return index Flattening our documents
  • 9. location_names = [] location_ids = [] for p in profile.locations.all(): location_names.append(str(p)) location_ids.append(p.id) What about data in related tables?
  • 10. Indexing our document into Elasticsearch def add_doc(self, data, id=doc_id): es_instance = Elasticsearch('https://my_elasticsearchserver') es_instance.index(index='my-index', doc_type='db-text',id=doc_id, body=data, refresh=True)
  • 11. Getting the data back out of Elasticsearch ● We’ll first need to perform our query to Elasticsearch ● Then grab the doc ids from the search results ● Use the doc ids to load the profiles from our database for the final search result response
  • 12. query_json = {'query': {'simple_query_string': {'query': 'Carmen Sandiego', 'fields':['first_name', 'last_name']}}} es_results = es_instance.search(index=self.index, body=query_json, size=limit, from_=offset) Performing our query
  • 13. { "took" : 63, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : null, "hits" : [ { "_index" : "my-new-index", "_type" : "db-text", "_id" : "1", "sort": [0], "_score" : null, "_source": {"first_name": "Carmen", "last_name":"Sandiego","locations": ["New York", "London", "Tangier"], "location_id": [1, 2, 3]} } ] } }
  • 14. search_results = [] for _id in raw_ids: try: search_results.append(Profile.objects.get(pk=_id)) except: pass return search_results Populating the search results
  • 15. How do we make Elasticsearch production ready?
  • 16. Using celery to distribute the task of indexing ● Celery is a distributed task queuing system ● Since indexing is a memory-bound task we don’t want it tying up server resources ● We’ll break up the task of indexing every one of our new documents initially into Elasticsearch into a separate task controlled by a larger master task ● New documents can be added incremental to our existing index by firing off a separate task
  • 17. from celery import group, task @task def index_all_docs(): ... group(process_doc.si(profile_id) for profile_id in profile_ids)() @task def process_doc(profile_id):
  • 18. How do we keep these datastores in sync?
  • 19. def save(self, *args, **kwargs): super(Profile, self).save(*args, **kwargs) celery.current_app.send_task('search_indexer.add_doc' (self.id,)) Syncing data to Elasticsearch
  • 20. Great, but what about partial updates?
  • 21. def update_doc(self, doc_id, data): es_instance = Elasticsearch('https://my_elasticsearchserver') es_instance.update(index='my-new-index', doc_type='db-text', id=doc_id, body={'doc': json.loads(data)}, refresh=True)
  • 22. Resources ● Code examples from today - http://bit.ly/python-search ● Elasticsearch-py - https://github.com/elastic/elasticsearch-py ● Elasticsearch official docs - https://www.elastic.co/guide/index.html ● Celery - https://github.com/celery/celery/