SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
FEBRUARY9, 2017, WARSAW
That won’t fit into RAM
Michał Brzezicki – Founder of SentiOne
FEBRUARY9, 2017, WARSAW
SentiOne - Powerful
online monitoring for
teams and enterprises
• Social Listening
• Research and analytics
• Social CRM
FEBRUARY9, 2017, WARSAW
Just some of our clients
FEBRUARY9, 2017, WARSAW
Technology behind
FEBRUARY9, 2017, WARSAW
…and even more
FEBRUARY9, 2017, WARSAW
Deployed on…
• 201 dedicated servers at OVH
• OVH (77 servers)
• SoYouStart (14 servers)
• Kimsufi (110 servers)
• 25 virtual cloud based machines using Microsoft Azure
FEBRUARY9, 2017, WARSAW
Brief architecture description
• Crawling servers gathering data from WWW and various APIs
• Intermediate analysing layer
• Database
• Web application with GUI for analysing data
FEBRUARY9, 2017, WARSAW
Key challenges
In efficient web crawling and data extraction
FEBRUARY9, 2017, WARSAW
Gathering data
• WWW
• Millions of domains
• Each HTML site is different
• Thousands of date formats
• Searching for new domains
• Only small portion do data is interesting
• Content, date, author, keywords
• Proprietary algorithm for data extraction
• Social Media APIs
FEBRUARY9, 2017, WARSAW
Sites we want to crawl
• Unvisited new sites with new content (i.e. new articles)
• Sites with links to sites with new content (i.e. forum thread list)
• Known sites with fresh content (hot articles with many comments)
FEBRUARY9, 2017, WARSAW
Crawling strategy
• Minimize time between publication of post and its discovery by
crawler
• Maximize number of extracted posts
• Minimize website hits
FEBRUARY9, 2017, WARSAW
Crawling queue
• The longer crawler works the better decision it makes
• 500 000+ monitored domains
• Enormous website graphs
• Sorted queue in RAM
• Limit on depth
• Unsorted list of URLs in HBase
• MapReduce jobs for creating job for crawler
FEBRUARY9, 2017, WARSAW
Lingering problems
• Generated traffic
• Proper detection of website size
• Kindness parameter – robots.txt crawl-delay
• Sites with bad HTML
• Dynamic websites
• Monitoring and storing performance statistics
• USER AGENT: SentiBot www.sentibot.eu (compatible with Googlebot)
FEBRUARY9, 2017, WARSAW
Research project
• Optimizing crawling is part of research grant done in collaboration
with Gdańsk University of Technology
• Funded by The National Centre for Research and Development
FEBRUARY9, 2017, WARSAW
Data cluster
Dos and don’ts
FEBRUARY9, 2017, WARSAW
ElasticSearch Tech specs
• 1 document type
• 28TB with replication
• 14.3B documents
• 848 shards
• 78 indices (partitioned by date and language)
• 48 nodes
FEBRUARY9, 2017, WARSAW
Load
• 2k search requests per second
• 1k index requests per second
• Approximately 75% of document insterts are duplicates
• Up to 45M new documents daily
FEBRUARY9, 2017, WARSAW
Configuration of each node
• 2 or 3 SSD disks in SOFT RAID 0
• CPU 4 or 8 physical cores (Intel Xeon D 1540/1520)
• 64GB RAM
• 30GB of heap for JVM running Elastic due to pointer compression
• Half of RAM used by OS filesystem cache (do not give JVM whole
RAM)
• Scaling performance by adding nodes to cluster
FEBRUARY9, 2017, WARSAW
Why ElasticSearch?
• Solr is single node
• Cloud version was still unstable at that time
• ye good ol’ Lucene
• Free and ready to use
• Fair enough documentation
• Automatic rebalancing and redundancy
• Quickly develops, strong community
FEBRUARY9, 2017, WARSAW
Main flaws of ElasticSearch
• Very fragile to network issues!
• Split brain
• Upgrade often requires reindexing data
• Schema changes not allowed without reindexing (only adding new fields)
• Thinks twice before setting mapping/tokenizers/analizers
• Make sure you monitor and store every cluster property (i.e. with Grafana)
• No out of the box security (changed in newest 5.x)
• Heavy queries may cause cluster failure
• Problems with memory leaks
• Load balancing is not perfect, some nodes/shards can be more ‘hot’
FEBRUARY9, 2017, WARSAW
Recommended Plugins
• Kopf – better than head
• Stempel – stemming for Polish
• ICU – transliteration for Greek and Cyrillic alphabets
• Worddelimiter2 – can handle more cases
• Decompound – Very useful in German
• Combo – combining multiple analyzers
• extended-analyze – debugging
• inquisitor - quick debugging
• Migration – checking what needs to be changed before migration
FEBRUARY9, 2017, WARSAW
FEBRUARY9, 2017, WARSAW
Ideas for the future
• Upgrade to newest version
• Move nodes to VLAN (vRack)
• Split master/data/client nodes
FEBRUARY9, 2017, WARSAW
Cassandra
• 1:1 relationship with data in ElasticSearch
• Used to store volatile metadata (number of likes, comments, views
etc.)
• Deployed on 8 nodes, 4TB of data
• Searching only by key fields
• Optimized for environments with more writes than reads
FEBRUARY9, 2017, WARSAW
Things that didn’t work
Shame! Shame! Shame!
FEBRUARY9, 2017, WARSAW
Scaling ES cluster in two locations
• Cluster distributed between Poland and France
• Network instability
FEBRUARY9, 2017, WARSAW
Not setting proper heap size for ElasticSearch
• Heap related problems? Increase heap size? No!
• More memory is not always better
• Keep below 31GB to have smaller pointers
FEBRUARY9, 2017, WARSAW
Not monitoring every variable of
ElasticSearch
• All status variables should be stored for comparison in case of a
downtime
• Default plugins show only current status
• Compare ES data with OS variables (i.e. load, io_wait, network)
FEBRUARY9, 2017, WARSAW
Not partitioning ES indices
• The less data you query the better
• Most calls are for newest data
• Gathers statistics on queries and partition indices accordingly!
• No easy metric to choose shard size/number
FEBRUARY9, 2017, WARSAW
Scaling ES cluster vertically
• No point in using servers with more than 64GB RAM
• Smaller number of nodes is easier to manage only in theory
• High IO on single node
FEBRUARY9, 2017, WARSAW
Not setting proper compaction strategy in
Cassandra
• Changing from Size-Tiered to Leveled Compaction
• Data was never compacted
• Adding new nodes to cluster just to finish compaction
• Cluster downtime due to disk space problems
FEBRUARY9, 2017, WARSAW
Abandoned technologies
• H2 Server – replaced with MariaDB, replaced with Percona
• Solr – replaced with ElasticSearch
• ActiveMQ – soon!
• Google Search API – replaced with YaCy
• Java -> Scala
• Maven -> SBT
• Grails -> Play!
• Vaadin -> ReactJS
FEBRUARY9, 2017, WARSAW
Overhyped technologies
• Hadoop cool… but not so long ago it was…
• Poorly documented
• Expensive to run, requires multiple nodes
• Overcomplicated
• Unstable
FEBRUARY9, 2017, WARSAW
Thank you
Michał Brzezicki
michal@sentione.com
https://pl.linkedin.com/in/brzezicki
+48 603 926 001

Mais conteúdo relacionado

Mais procurados

Wix sql on-storm-platform
Wix sql on-storm-platformWix sql on-storm-platform
Wix sql on-storm-platform
alooma
 
OWLIM@AWS - On-demand RDF Data Management in the Cloud
OWLIM@AWS - On-demand RDF Data Management in the CloudOWLIM@AWS - On-demand RDF Data Management in the Cloud
OWLIM@AWS - On-demand RDF Data Management in the Cloud
Marin Dimitrov
 

Mais procurados (20)

Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache Superset
 
Presto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On LabPresto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On Lab
 
Why you really want SQL in a Real-Time Enterprise Environment
Why you really want SQL in a Real-Time Enterprise EnvironmentWhy you really want SQL in a Real-Time Enterprise Environment
Why you really want SQL in a Real-Time Enterprise Environment
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Scylla Summit 2022: Multi-cloud State for k8s: Anthos and ScyllaDB
Scylla Summit 2022: Multi-cloud State for k8s: Anthos and ScyllaDBScylla Summit 2022: Multi-cloud State for k8s: Anthos and ScyllaDB
Scylla Summit 2022: Multi-cloud State for k8s: Anthos and ScyllaDB
 
Wix sql on-storm-platform
Wix sql on-storm-platformWix sql on-storm-platform
Wix sql on-storm-platform
 
OWLIM@AWS - On-demand RDF Data Management in the Cloud
OWLIM@AWS - On-demand RDF Data Management in the CloudOWLIM@AWS - On-demand RDF Data Management in the Cloud
OWLIM@AWS - On-demand RDF Data Management in the Cloud
 
The Evolution of the Fashion Retail Industry in the Age of AI with Kshitij Ku...
The Evolution of the Fashion Retail Industry in the Age of AI with Kshitij Ku...The Evolution of the Fashion Retail Industry in the Age of AI with Kshitij Ku...
The Evolution of the Fashion Retail Industry in the Age of AI with Kshitij Ku...
 
Exploring Alluxio for Daily Tasks at Robinhood
Exploring Alluxio for Daily Tasks at RobinhoodExploring Alluxio for Daily Tasks at Robinhood
Exploring Alluxio for Daily Tasks at Robinhood
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Scylla Summit 2022: ScyllaDB Cloud: Simplifying Deployment to the Public Cloud
Scylla Summit 2022: ScyllaDB Cloud: Simplifying Deployment to the Public CloudScylla Summit 2022: ScyllaDB Cloud: Simplifying Deployment to the Public Cloud
Scylla Summit 2022: ScyllaDB Cloud: Simplifying Deployment to the Public Cloud
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
 

Semelhante a That won’t fit into RAM - Michał Brzezicki

Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013
Jean-Pierre König
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summit
drewz lin
 

Semelhante a That won’t fit into RAM - Michał Brzezicki (20)

Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
 
H2 o deep water making deep learning accessible to everyone -jo-fai chow
H2 o deep water   making deep learning accessible to everyone -jo-fai chowH2 o deep water   making deep learning accessible to everyone -jo-fai chow
H2 o deep water making deep learning accessible to everyone -jo-fai chow
 
Introduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDBIntroduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDB
 
Hadoop
HadoopHadoop
Hadoop
 
EBS on Oracle Cloud
EBS on Oracle CloudEBS on Oracle Cloud
EBS on Oracle Cloud
 
NoSQL Seminer
NoSQL SeminerNoSQL Seminer
NoSQL Seminer
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Sql vs nosql
Sql vs nosqlSql vs nosql
Sql vs nosql
 
Share point 2013 on azure
Share point 2013 on azureShare point 2013 on azure
Share point 2013 on azure
 
Introduction to Couchbase
Introduction to CouchbaseIntroduction to Couchbase
Introduction to Couchbase
 
Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017
Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017
Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017
 
NOsql Presentation.pdf
NOsql Presentation.pdfNOsql Presentation.pdf
NOsql Presentation.pdf
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overview
 
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearchBigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013
 
PSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL ServerPSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL Server
 
NoSQL Architecture Overview
NoSQL Architecture OverviewNoSQL Architecture Overview
NoSQL Architecture Overview
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summit
 

Mais de Evention

Stream Analytics with SQL on Apache Flink - Fabian Hueske
Stream Analytics with SQL on Apache Flink - Fabian HueskeStream Analytics with SQL on Apache Flink - Fabian Hueske
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Evention
 

Mais de Evention (20)

The Factorization Machines algorithm for building recommendation system - Paw...
The Factorization Machines algorithm for building recommendation system - Paw...The Factorization Machines algorithm for building recommendation system - Paw...
The Factorization Machines algorithm for building recommendation system - Paw...
 
A/B testing powered by Big data - Saurabh Goyal, Booking.com
A/B testing powered by Big data - Saurabh Goyal, Booking.comA/B testing powered by Big data - Saurabh Goyal, Booking.com
A/B testing powered by Big data - Saurabh Goyal, Booking.com
 
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...
 
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
 
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...
 
Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform
Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, AdformBuilding a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform
Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
 
Privacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, MapflatPrivacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, Mapflat
 
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
 
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...
 
Enhancing Spark - increase streaming capabilities of your applications - Kami...
Enhancing Spark - increase streaming capabilities of your applications - Kami...Enhancing Spark - increase streaming capabilities of your applications - Kami...
Enhancing Spark - increase streaming capabilities of your applications - Kami...
 
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...
 
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...
 
Stream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data ArtisansStream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data Artisans
 
Scaling Cassandra in all directions - Jimmy Mardell Spotify
Scaling Cassandra in all directions - Jimmy Mardell SpotifyScaling Cassandra in all directions - Jimmy Mardell Spotify
Scaling Cassandra in all directions - Jimmy Mardell Spotify
 
Big Data for unstructured data Dariusz Śliwa
Big Data for unstructured data Dariusz ŚliwaBig Data for unstructured data Dariusz Śliwa
Big Data for unstructured data Dariusz Śliwa
 
Elastic development. Implementing Big Data search Grzegorz Kołpuć
Elastic development. Implementing Big Data search Grzegorz KołpućElastic development. Implementing Big Data search Grzegorz Kołpuć
Elastic development. Implementing Big Data search Grzegorz Kołpuć
 
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Stream Analytics with SQL on Apache Flink - Fabian HueskeStream Analytics with SQL on Apache Flink - Fabian Hueske
Stream Analytics with SQL on Apache Flink - Fabian Hueske
 
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
 
ING CoreIntel - collect and process network logs across data centers in near ...
ING CoreIntel - collect and process network logs across data centers in near ...ING CoreIntel - collect and process network logs across data centers in near ...
ING CoreIntel - collect and process network logs across data centers in near ...
 

Último

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 

Último (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 

That won’t fit into RAM - Michał Brzezicki

  • 1. FEBRUARY9, 2017, WARSAW That won’t fit into RAM Michał Brzezicki – Founder of SentiOne
  • 2. FEBRUARY9, 2017, WARSAW SentiOne - Powerful online monitoring for teams and enterprises • Social Listening • Research and analytics • Social CRM
  • 3. FEBRUARY9, 2017, WARSAW Just some of our clients
  • 6. FEBRUARY9, 2017, WARSAW Deployed on… • 201 dedicated servers at OVH • OVH (77 servers) • SoYouStart (14 servers) • Kimsufi (110 servers) • 25 virtual cloud based machines using Microsoft Azure
  • 7. FEBRUARY9, 2017, WARSAW Brief architecture description • Crawling servers gathering data from WWW and various APIs • Intermediate analysing layer • Database • Web application with GUI for analysing data
  • 8. FEBRUARY9, 2017, WARSAW Key challenges In efficient web crawling and data extraction
  • 9. FEBRUARY9, 2017, WARSAW Gathering data • WWW • Millions of domains • Each HTML site is different • Thousands of date formats • Searching for new domains • Only small portion do data is interesting • Content, date, author, keywords • Proprietary algorithm for data extraction • Social Media APIs
  • 10. FEBRUARY9, 2017, WARSAW Sites we want to crawl • Unvisited new sites with new content (i.e. new articles) • Sites with links to sites with new content (i.e. forum thread list) • Known sites with fresh content (hot articles with many comments)
  • 11. FEBRUARY9, 2017, WARSAW Crawling strategy • Minimize time between publication of post and its discovery by crawler • Maximize number of extracted posts • Minimize website hits
  • 12. FEBRUARY9, 2017, WARSAW Crawling queue • The longer crawler works the better decision it makes • 500 000+ monitored domains • Enormous website graphs • Sorted queue in RAM • Limit on depth • Unsorted list of URLs in HBase • MapReduce jobs for creating job for crawler
  • 13. FEBRUARY9, 2017, WARSAW Lingering problems • Generated traffic • Proper detection of website size • Kindness parameter – robots.txt crawl-delay • Sites with bad HTML • Dynamic websites • Monitoring and storing performance statistics • USER AGENT: SentiBot www.sentibot.eu (compatible with Googlebot)
  • 14. FEBRUARY9, 2017, WARSAW Research project • Optimizing crawling is part of research grant done in collaboration with Gdańsk University of Technology • Funded by The National Centre for Research and Development
  • 15. FEBRUARY9, 2017, WARSAW Data cluster Dos and don’ts
  • 16. FEBRUARY9, 2017, WARSAW ElasticSearch Tech specs • 1 document type • 28TB with replication • 14.3B documents • 848 shards • 78 indices (partitioned by date and language) • 48 nodes
  • 17. FEBRUARY9, 2017, WARSAW Load • 2k search requests per second • 1k index requests per second • Approximately 75% of document insterts are duplicates • Up to 45M new documents daily
  • 18. FEBRUARY9, 2017, WARSAW Configuration of each node • 2 or 3 SSD disks in SOFT RAID 0 • CPU 4 or 8 physical cores (Intel Xeon D 1540/1520) • 64GB RAM • 30GB of heap for JVM running Elastic due to pointer compression • Half of RAM used by OS filesystem cache (do not give JVM whole RAM) • Scaling performance by adding nodes to cluster
  • 19. FEBRUARY9, 2017, WARSAW Why ElasticSearch? • Solr is single node • Cloud version was still unstable at that time • ye good ol’ Lucene • Free and ready to use • Fair enough documentation • Automatic rebalancing and redundancy • Quickly develops, strong community
  • 20. FEBRUARY9, 2017, WARSAW Main flaws of ElasticSearch • Very fragile to network issues! • Split brain • Upgrade often requires reindexing data • Schema changes not allowed without reindexing (only adding new fields) • Thinks twice before setting mapping/tokenizers/analizers • Make sure you monitor and store every cluster property (i.e. with Grafana) • No out of the box security (changed in newest 5.x) • Heavy queries may cause cluster failure • Problems with memory leaks • Load balancing is not perfect, some nodes/shards can be more ‘hot’
  • 21. FEBRUARY9, 2017, WARSAW Recommended Plugins • Kopf – better than head • Stempel – stemming for Polish • ICU – transliteration for Greek and Cyrillic alphabets • Worddelimiter2 – can handle more cases • Decompound – Very useful in German • Combo – combining multiple analyzers • extended-analyze – debugging • inquisitor - quick debugging • Migration – checking what needs to be changed before migration
  • 23. FEBRUARY9, 2017, WARSAW Ideas for the future • Upgrade to newest version • Move nodes to VLAN (vRack) • Split master/data/client nodes
  • 24. FEBRUARY9, 2017, WARSAW Cassandra • 1:1 relationship with data in ElasticSearch • Used to store volatile metadata (number of likes, comments, views etc.) • Deployed on 8 nodes, 4TB of data • Searching only by key fields • Optimized for environments with more writes than reads
  • 25. FEBRUARY9, 2017, WARSAW Things that didn’t work Shame! Shame! Shame!
  • 26. FEBRUARY9, 2017, WARSAW Scaling ES cluster in two locations • Cluster distributed between Poland and France • Network instability
  • 27. FEBRUARY9, 2017, WARSAW Not setting proper heap size for ElasticSearch • Heap related problems? Increase heap size? No! • More memory is not always better • Keep below 31GB to have smaller pointers
  • 28. FEBRUARY9, 2017, WARSAW Not monitoring every variable of ElasticSearch • All status variables should be stored for comparison in case of a downtime • Default plugins show only current status • Compare ES data with OS variables (i.e. load, io_wait, network)
  • 29. FEBRUARY9, 2017, WARSAW Not partitioning ES indices • The less data you query the better • Most calls are for newest data • Gathers statistics on queries and partition indices accordingly! • No easy metric to choose shard size/number
  • 30. FEBRUARY9, 2017, WARSAW Scaling ES cluster vertically • No point in using servers with more than 64GB RAM • Smaller number of nodes is easier to manage only in theory • High IO on single node
  • 31. FEBRUARY9, 2017, WARSAW Not setting proper compaction strategy in Cassandra • Changing from Size-Tiered to Leveled Compaction • Data was never compacted • Adding new nodes to cluster just to finish compaction • Cluster downtime due to disk space problems
  • 32. FEBRUARY9, 2017, WARSAW Abandoned technologies • H2 Server – replaced with MariaDB, replaced with Percona • Solr – replaced with ElasticSearch • ActiveMQ – soon! • Google Search API – replaced with YaCy • Java -> Scala • Maven -> SBT • Grails -> Play! • Vaadin -> ReactJS
  • 33. FEBRUARY9, 2017, WARSAW Overhyped technologies • Hadoop cool… but not so long ago it was… • Poorly documented • Expensive to run, requires multiple nodes • Overcomplicated • Unstable
  • 34. FEBRUARY9, 2017, WARSAW Thank you Michał Brzezicki michal@sentione.com https://pl.linkedin.com/in/brzezicki +48 603 926 001