SlideShare uma empresa Scribd logo
1 de 18
Baixar para ler offline
MIGRATING A 130TB CLUSTER FROM
ELASTICSEARCH 2 TO 5 IN 20 HOURS WITHOUT
DOWNTIME
FRED DE VILLAMIL
@FDEVILLAMIL
OCTOBER 2017
ABOUT ME
FRED DE VILLAMIL, FORMER DIRECTOR OF INFRASTRUCTURE
@SYNTHESIO
FIRST ELASTICSEARCH IN PRODUCTION WAS 0.17.6
LINUX / (FREE)BSD USER SINCE 1996,
OPEN SOURCE CONTRIBUTOR SINCE 1998,
LOVES COOL TECHS, TENNIS, PHOTOGRAPHY, CUTE OTTERS,
INAPPROPRIATE HUMOR AND ELASTICSEARCH CLUSTERS OF UNUSUAL
SIZE.
WRITES ABOUT ES & MORE AT HTTPS://THOUGHTS.T37.NET
ABOUT SYNTHESIO
SYNTHESIO IS THE LEADING SOCIAL INTELLIGENCE TOOL FOR
SOCIAL MEDIA MONITORING & SOCIAL ANALYTICS
SYNTHESIO CRAWLS THE WEB FOR RELEVANT DATA, ENRICHES
IT WITH SENTIMENT ANALYSIS AND DEMOGRAPHICS TO BUILD
SOCIAL ANALYTICS DASHBOARDS.
ELASTICSEARCH @SYNTHESIO
8 production clusters:
• +600 hosts, all bare metal
• 3 data center
• 1.7PB storage SSD / NVME
• 37.5TB RAM
Hardware:
• 6 core Xeon E5v3 or bi Xeon E5-2687Wv4
12 core (160 watts!!!)
• 64GB to 256GB RAM
• 4 x 800GB SSD / 2 x 1.2TB NVME
• RAID0 everywhere
We agregate data from various cold
storage and make them searchable in a
giffy.
Average cluster stats
• writes: 85k documents / second, 1.5M
in peak
• 800 search /s, with some cluster
having a continuous 25k search /
second
• Doc size from 150KB to 200MB
THE BLACKHOLE CLUSTER
Topology
• 68 data nodes
• 3 master nodes
• 6 ingest nodes
• 200TB storage SSD
• 2.4TB heap
• 924 core
Cluster stats:
• 1137 indices (daily)
• 27266, shards
• 130TB data
• 201 billion documents
• 7000 new documents / second
• 800 search / second on the whole dataset
FEEDING BLACKHOLE FOR FUN AND PROFIT
BLACKHOLE ALLOCATION SETTINGS
"CLUSTER.ROUTING.ALLOCATION.NODE_INITIAL_PRIMARIES_RECOVERIES": 50
"CLUSTER.ROUTING.ALLOCATION.NODE_CONCURRENT_RECOVERIES": 20
"INDICES.RECOVERY.MAX_BYTES_PER_SEC": "2048MB"
"INDICES.RECOVERY.CONCURRENT_STREAMS": "30"
"CLUSTER.ROUTING.ALLOCATION.DISK.THRESHOLD_ENABLED" : TRUE
"CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.LOW" : "78%"
"CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.HIGH" : "79%"
“CLUSTER.ROUTING.REBALANCE.ENABLE": "ALL"
"CLUSTER.ROUTING.ALLOCATION.CLUSTER_CONCURRENT_REBALANCE": 50
"CLUSTER.ROUTING.ALLOCATION.ALLOW_REBALANCE": "ALWAYS"
USING THE REINDEX API?
REINDEX API:
• NO SLICED SCROLL UNTIL ES
6.0
• SLOW
• MIGHT LOSE SOME DOCUMENTS,
NEEDS LOTS OF ERROR CONTROL
LOGSTASH:
• NO SLICED SCROLLS UNTIL ES
6.0
• FASTER THAN THE REINDEX API
• REALLY DOESN’T LIKE ERRORS
BEFORE UPGRADING
• USE THE UPGRADE CHECK PLUGIN TO VALIDATE CURRENT INDEXES
COMPATIBILITY
• UPGRADE YOUR MAPPING TEMPLATES TO BE ES 5 COMPLIANT
• CREATE THE NEXT 10 DAYS INDEXES (JUST IN CASE)
• TELL YOUR HOSTING PROVIDER YOU’RE GOING TO TRANSFER 130TB
IN 17 HOURS
EXPANDING BLACKHOLE
OPS:
• +90 NEW SERVERS IN 2 NEW RACKS
• RAISED THE REPLICATION FACTOR TO 3
RESULT:
• 167 NODES
• 53626 SHARDS
• 279TB DATA
• 391TB STORAGE
• 5.42TB HEAP
• 2004 CORE
SETTINGS UPDATE DURING THE REPLICA INIT
"INDICES.RECOVERY.MAX_BYTES_PER_SEC": “4096MB"
"INDICES.RECOVERY.CONCURRENT_STREAMS": "50"
"CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.LOW" : "98%"
"CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.HIGH" : “99%"
"CLUSTER.ROUTING.REBALANCE.ENABLE": “NONE"
PROBLEMS
• THE TRANSFER PUT THE
WHOLE CLUSTER ON
THEIR KNEES.
• THIS SLOWERS THE
WRITES.
• THE BULK THREAD POOL
STARTS TO FILL IN.
SOLUTION: ZONING FOR FUN & PROFIT
• ALLOCATE THE FRESHEST DATA AND
ONGOING IN A ZONE
• SEGREGATE EVERYTHING ELSE IN A
DIFFERENT ZONE
• WAIT FOR THE CLUSTER TO CALM
DOWN
• TOTAL SPENT TIME FOR THE
TRANSFER: 17 HOURS
SPLITTING THE CLUSTER IN 2
• SET
"CLUSTER.ROUTING.ALLOC
ATION.ENABLE" TO "ALL"
• SHUTDOWN 2 OF THE RACKS
• SHUTDOWN ONE OF THE
MASTERS
• SWITCH THE NUMBER OF
REPLICAS TO 1
BUILDING BLACKHOLE02
• RECONFIGURE THE 2 SHUTDOWN RACKS AND MASTER SO
THEY TALK TO EACH OTHER
• START THE MASTER, ALONE, CLOSE THE INDEXES
• UPGRADE THE MASTER TO ES 5.1.1
• UPGRADE ALL THE PLUGINS
• START THE MASTER: THE WHOLE UPGRADE TOOK 32 SECONDS
BRINGIN BACK THE DATA
• UPGRADE ES AND THE PLUGINS ON THE DATA NODES
• START ELASTICSEARCH
• WAIT 30 MINUTES FOR THE CLUSTER TO GO BACK GREEN
• PLUG A WORK UNIT TO CATCH UP WITH THE PAST 18 HOURS
OF DATA
• UPDATE THE LOAD BALANCER CONFIGURATION TO USE THE
NEWLY UPGRADED CLUSTER
TIMELINE
QUESTIONS ?
@FDEVILLAMIL

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Seattle Cassandra Meetup - HasOffers
Seattle Cassandra Meetup - HasOffersSeattle Cassandra Meetup - HasOffers
Seattle Cassandra Meetup - HasOffers
 
Apache HBase at Airbnb
Apache HBase at Airbnb Apache HBase at Airbnb
Apache HBase at Airbnb
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
 
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
 
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
 
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
 
Blueflood: Open Source Metrics Processing at CassandraEU 2013
Blueflood: Open Source Metrics Processing at CassandraEU 2013Blueflood: Open Source Metrics Processing at CassandraEU 2013
Blueflood: Open Source Metrics Processing at CassandraEU 2013
 
ELK: a log management framework
ELK: a log management frameworkELK: a log management framework
ELK: a log management framework
 
Taking Your Database Global with Kubernetes
Taking Your Database Global with KubernetesTaking Your Database Global with Kubernetes
Taking Your Database Global with Kubernetes
 
Artmosphere Demo
Artmosphere DemoArtmosphere Demo
Artmosphere Demo
 
Scaling Writes on CockroachDB with Apache NiFi
Scaling Writes on CockroachDB with Apache NiFiScaling Writes on CockroachDB with Apache NiFi
Scaling Writes on CockroachDB with Apache NiFi
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
Kafka Workshop
Kafka WorkshopKafka Workshop
Kafka Workshop
 
Fake It 'Til You Make It
Fake It 'Til You Make ItFake It 'Til You Make It
Fake It 'Til You Make It
 
Timezone Mess
Timezone MessTimezone Mess
Timezone Mess
 
Building a Data Plane with K8ssandra, Apache Cassandra on Kubernetes
Building a Data Plane with K8ssandra, Apache Cassandra on KubernetesBuilding a Data Plane with K8ssandra, Apache Cassandra on Kubernetes
Building a Data Plane with K8ssandra, Apache Cassandra on Kubernetes
 
Introduction to InfluxDB and TICK Stack
Introduction to InfluxDB and TICK StackIntroduction to InfluxDB and TICK Stack
Introduction to InfluxDB and TICK Stack
 
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
Geospatial Data Visualization: WorldMap Integration by Raman PrasadGeospatial Data Visualization: WorldMap Integration by Raman Prasad
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
 

Semelhante a Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime

Apache Mesos and the new Open Source Architecture of the Modern Datacenter
Apache Mesos and the new Open Source Architecture of the Modern DatacenterApache Mesos and the new Open Source Architecture of the Modern Datacenter
Apache Mesos and the new Open Source Architecture of the Modern Datacenter
Data Con LA
 
Peter Diamandis slides 18-1 e le Organizzazioni a crescita esponenziale
Peter Diamandis slides 18-1 e le Organizzazioni a crescita esponenzialePeter Diamandis slides 18-1 e le Organizzazioni a crescita esponenziale
Peter Diamandis slides 18-1 e le Organizzazioni a crescita esponenziale
Rilevanteam
 
big data et data viz - du lac à votre écran - afterwork
big data et data viz - du lac à votre écran - afterwork big data et data viz - du lac à votre écran - afterwork
big data et data viz - du lac à votre écran - afterwork
OCTO Technology Suisse
 
From the Big Bang to Ecommerce, a journey in making sense of Big Data
From the Big Bang to Ecommerce, a journey in making sense of Big DataFrom the Big Bang to Ecommerce, a journey in making sense of Big Data
From the Big Bang to Ecommerce, a journey in making sense of Big Data
Patrick Deglon
 

Semelhante a Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime (20)

Apache Mesos and the new Open Source Architecture of the Modern Datacenter
Apache Mesos and the new Open Source Architecture of the Modern DatacenterApache Mesos and the new Open Source Architecture of the Modern Datacenter
Apache Mesos and the new Open Source Architecture of the Modern Datacenter
 
AWS Summit Kuala Lumpur - Opening Keynote by Dr. Werner Vogels
AWS Summit Kuala Lumpur - Opening Keynote by Dr. Werner VogelsAWS Summit Kuala Lumpur - Opening Keynote by Dr. Werner Vogels
AWS Summit Kuala Lumpur - Opening Keynote by Dr. Werner Vogels
 
Enabling Lean IT with AWS by Carlos Condé at the Lean IT Summit 2014
Enabling Lean IT with AWS by Carlos Condé at the Lean IT Summit 2014Enabling Lean IT with AWS by Carlos Condé at the Lean IT Summit 2014
Enabling Lean IT with AWS by Carlos Condé at the Lean IT Summit 2014
 
Afterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écranAfterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écran
 
Peter Diamandis slides 18-1 e le Organizzazioni a crescita esponenziale
Peter Diamandis slides 18-1 e le Organizzazioni a crescita esponenzialePeter Diamandis slides 18-1 e le Organizzazioni a crescita esponenziale
Peter Diamandis slides 18-1 e le Organizzazioni a crescita esponenziale
 
big data et data viz - du lac à votre écran - afterwork
big data et data viz - du lac à votre écran - afterwork big data et data viz - du lac à votre écran - afterwork
big data et data viz - du lac à votre écran - afterwork
 
Multi-Tenant Hybrid Solution based on Hybrid Connections & App Service
Multi-Tenant Hybrid Solution based on Hybrid Connections & App ServiceMulti-Tenant Hybrid Solution based on Hybrid Connections & App Service
Multi-Tenant Hybrid Solution based on Hybrid Connections & App Service
 
Designing for Sustainability - WebVisions 2016
Designing for Sustainability - WebVisions 2016Designing for Sustainability - WebVisions 2016
Designing for Sustainability - WebVisions 2016
 
The Evolution of Blue Ocean Databases, from SQL to Blockchain
The Evolution of Blue Ocean Databases, from SQL to BlockchainThe Evolution of Blue Ocean Databases, from SQL to Blockchain
The Evolution of Blue Ocean Databases, from SQL to Blockchain
 
HUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesHUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation Slides
 
Turning Business Drivers into Business
Turning Business Drivers into BusinessTurning Business Drivers into Business
Turning Business Drivers into Business
 
Alluxio Mesos Meetup - SMACK to SMAACK
Alluxio Mesos Meetup - SMACK to SMAACKAlluxio Mesos Meetup - SMACK to SMAACK
Alluxio Mesos Meetup - SMACK to SMAACK
 
Opening presentation by Trent McConaghy at BigchainDB Hackfest #1 - Feb 28, 2017
Opening presentation by Trent McConaghy at BigchainDB Hackfest #1 - Feb 28, 2017Opening presentation by Trent McConaghy at BigchainDB Hackfest #1 - Feb 28, 2017
Opening presentation by Trent McConaghy at BigchainDB Hackfest #1 - Feb 28, 2017
 
Cloud Foundry vs Docker vs Kubernetes - http://bit.ly/2rzUM2U
Cloud Foundry vs Docker vs Kubernetes - http://bit.ly/2rzUM2UCloud Foundry vs Docker vs Kubernetes - http://bit.ly/2rzUM2U
Cloud Foundry vs Docker vs Kubernetes - http://bit.ly/2rzUM2U
 
From the Big Bang to Ecommerce, a journey in making sense of Big Data
From the Big Bang to Ecommerce, a journey in making sense of Big DataFrom the Big Bang to Ecommerce, a journey in making sense of Big Data
From the Big Bang to Ecommerce, a journey in making sense of Big Data
 
DOO-007_How to run containers in production, at scale!
DOO-007_How to run containers in production, at scale!DOO-007_How to run containers in production, at scale!
DOO-007_How to run containers in production, at scale!
 
Couchbase and Apache Kafka - Bridging the gap between RDBMS and NoSQL
Couchbase and Apache Kafka - Bridging the gap between RDBMS and NoSQLCouchbase and Apache Kafka - Bridging the gap between RDBMS and NoSQL
Couchbase and Apache Kafka - Bridging the gap between RDBMS and NoSQL
 
A Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big DataA Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big Data
 
Flink Forward San Francisco 2019: Flink Powered Customer Experience: Scaling ...
Flink Forward San Francisco 2019: Flink Powered Customer Experience: Scaling ...Flink Forward San Francisco 2019: Flink Powered Customer Experience: Scaling ...
Flink Forward San Francisco 2019: Flink Powered Customer Experience: Scaling ...
 
Trm_pitch_final
Trm_pitch_finalTrm_pitch_final
Trm_pitch_final
 

Mais de Fred de Villamil

Mais de Fred de Villamil (10)

Scaling your Engineering Team
Scaling your Engineering TeamScaling your Engineering Team
Scaling your Engineering Team
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
 
Hiring and Managing Happy Engineers - CTO Pizza #3
Hiring and Managing Happy Engineers - CTO Pizza #3Hiring and Managing Happy Engineers - CTO Pizza #3
Hiring and Managing Happy Engineers - CTO Pizza #3
 
Running & Scaling Large Elasticsearch Clusters
Running & Scaling Large Elasticsearch ClustersRunning & Scaling Large Elasticsearch Clusters
Running & Scaling Large Elasticsearch Clusters
 
Devops commando - Paris Devops 2016-04
Devops commando - Paris Devops 2016-04Devops commando - Paris Devops 2016-04
Devops commando - Paris Devops 2016-04
 
The Commando Devops
The Commando DevopsThe Commando Devops
The Commando Devops
 
How People Use Iphone
How People Use IphoneHow People Use Iphone
How People Use Iphone
 
Zendcon Performance Oci8
Zendcon Performance Oci8Zendcon Performance Oci8
Zendcon Performance Oci8
 
Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...
Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...
Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...
 
Presentation Rails
Presentation RailsPresentation Rails
Presentation Rails
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime

  • 1. MIGRATING A 130TB CLUSTER FROM ELASTICSEARCH 2 TO 5 IN 20 HOURS WITHOUT DOWNTIME FRED DE VILLAMIL @FDEVILLAMIL OCTOBER 2017
  • 2. ABOUT ME FRED DE VILLAMIL, FORMER DIRECTOR OF INFRASTRUCTURE @SYNTHESIO FIRST ELASTICSEARCH IN PRODUCTION WAS 0.17.6 LINUX / (FREE)BSD USER SINCE 1996, OPEN SOURCE CONTRIBUTOR SINCE 1998, LOVES COOL TECHS, TENNIS, PHOTOGRAPHY, CUTE OTTERS, INAPPROPRIATE HUMOR AND ELASTICSEARCH CLUSTERS OF UNUSUAL SIZE. WRITES ABOUT ES & MORE AT HTTPS://THOUGHTS.T37.NET
  • 3. ABOUT SYNTHESIO SYNTHESIO IS THE LEADING SOCIAL INTELLIGENCE TOOL FOR SOCIAL MEDIA MONITORING & SOCIAL ANALYTICS SYNTHESIO CRAWLS THE WEB FOR RELEVANT DATA, ENRICHES IT WITH SENTIMENT ANALYSIS AND DEMOGRAPHICS TO BUILD SOCIAL ANALYTICS DASHBOARDS.
  • 4. ELASTICSEARCH @SYNTHESIO 8 production clusters: • +600 hosts, all bare metal • 3 data center • 1.7PB storage SSD / NVME • 37.5TB RAM Hardware: • 6 core Xeon E5v3 or bi Xeon E5-2687Wv4 12 core (160 watts!!!) • 64GB to 256GB RAM • 4 x 800GB SSD / 2 x 1.2TB NVME • RAID0 everywhere We agregate data from various cold storage and make them searchable in a giffy. Average cluster stats • writes: 85k documents / second, 1.5M in peak • 800 search /s, with some cluster having a continuous 25k search / second • Doc size from 150KB to 200MB
  • 5. THE BLACKHOLE CLUSTER Topology • 68 data nodes • 3 master nodes • 6 ingest nodes • 200TB storage SSD • 2.4TB heap • 924 core Cluster stats: • 1137 indices (daily) • 27266, shards • 130TB data • 201 billion documents • 7000 new documents / second • 800 search / second on the whole dataset
  • 6. FEEDING BLACKHOLE FOR FUN AND PROFIT
  • 7. BLACKHOLE ALLOCATION SETTINGS "CLUSTER.ROUTING.ALLOCATION.NODE_INITIAL_PRIMARIES_RECOVERIES": 50 "CLUSTER.ROUTING.ALLOCATION.NODE_CONCURRENT_RECOVERIES": 20 "INDICES.RECOVERY.MAX_BYTES_PER_SEC": "2048MB" "INDICES.RECOVERY.CONCURRENT_STREAMS": "30" "CLUSTER.ROUTING.ALLOCATION.DISK.THRESHOLD_ENABLED" : TRUE "CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.LOW" : "78%" "CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.HIGH" : "79%" “CLUSTER.ROUTING.REBALANCE.ENABLE": "ALL" "CLUSTER.ROUTING.ALLOCATION.CLUSTER_CONCURRENT_REBALANCE": 50 "CLUSTER.ROUTING.ALLOCATION.ALLOW_REBALANCE": "ALWAYS"
  • 8. USING THE REINDEX API? REINDEX API: • NO SLICED SCROLL UNTIL ES 6.0 • SLOW • MIGHT LOSE SOME DOCUMENTS, NEEDS LOTS OF ERROR CONTROL LOGSTASH: • NO SLICED SCROLLS UNTIL ES 6.0 • FASTER THAN THE REINDEX API • REALLY DOESN’T LIKE ERRORS
  • 9. BEFORE UPGRADING • USE THE UPGRADE CHECK PLUGIN TO VALIDATE CURRENT INDEXES COMPATIBILITY • UPGRADE YOUR MAPPING TEMPLATES TO BE ES 5 COMPLIANT • CREATE THE NEXT 10 DAYS INDEXES (JUST IN CASE) • TELL YOUR HOSTING PROVIDER YOU’RE GOING TO TRANSFER 130TB IN 17 HOURS
  • 10. EXPANDING BLACKHOLE OPS: • +90 NEW SERVERS IN 2 NEW RACKS • RAISED THE REPLICATION FACTOR TO 3 RESULT: • 167 NODES • 53626 SHARDS • 279TB DATA • 391TB STORAGE • 5.42TB HEAP • 2004 CORE
  • 11. SETTINGS UPDATE DURING THE REPLICA INIT "INDICES.RECOVERY.MAX_BYTES_PER_SEC": “4096MB" "INDICES.RECOVERY.CONCURRENT_STREAMS": "50" "CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.LOW" : "98%" "CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.HIGH" : “99%" "CLUSTER.ROUTING.REBALANCE.ENABLE": “NONE"
  • 12. PROBLEMS • THE TRANSFER PUT THE WHOLE CLUSTER ON THEIR KNEES. • THIS SLOWERS THE WRITES. • THE BULK THREAD POOL STARTS TO FILL IN.
  • 13. SOLUTION: ZONING FOR FUN & PROFIT • ALLOCATE THE FRESHEST DATA AND ONGOING IN A ZONE • SEGREGATE EVERYTHING ELSE IN A DIFFERENT ZONE • WAIT FOR THE CLUSTER TO CALM DOWN • TOTAL SPENT TIME FOR THE TRANSFER: 17 HOURS
  • 14. SPLITTING THE CLUSTER IN 2 • SET "CLUSTER.ROUTING.ALLOC ATION.ENABLE" TO "ALL" • SHUTDOWN 2 OF THE RACKS • SHUTDOWN ONE OF THE MASTERS • SWITCH THE NUMBER OF REPLICAS TO 1
  • 15. BUILDING BLACKHOLE02 • RECONFIGURE THE 2 SHUTDOWN RACKS AND MASTER SO THEY TALK TO EACH OTHER • START THE MASTER, ALONE, CLOSE THE INDEXES • UPGRADE THE MASTER TO ES 5.1.1 • UPGRADE ALL THE PLUGINS • START THE MASTER: THE WHOLE UPGRADE TOOK 32 SECONDS
  • 16. BRINGIN BACK THE DATA • UPGRADE ES AND THE PLUGINS ON THE DATA NODES • START ELASTICSEARCH • WAIT 30 MINUTES FOR THE CLUSTER TO GO BACK GREEN • PLUG A WORK UNIT TO CATCH UP WITH THE PAST 18 HOURS OF DATA • UPDATE THE LOAD BALANCER CONFIGURATION TO USE THE NEWLY UPGRADED CLUSTER