That won’t fit into RAM - Michał Brzezicki

FEBRUARY9, 2017, WARSAW
That won’t fit into RAM
Michał Brzezicki – Founder of SentiOne

SentiOne - Powerful
online monitoring for
teams and enterprises
• Social Listening
• Research and analytics
• Social CRM

Just some of our clients

Technology behind

…and even more

Deployed on…
• 201 dedicated servers at OVH
• OVH (77 servers)
• SoYouStart (14 servers)
• Kimsufi (110 servers)
• 25 virtual cloud based machines using Microsoft Azure

Brief architecture description
• Crawling servers gathering data from WWW and various APIs
• Intermediate analysing layer
• Database
• Web application with GUI for analysing data

Key challenges
In efficient web crawling and data extraction

Gathering data
• WWW
• Millions of domains
• Each HTML site is different
• Thousands of date formats
• Searching for new domains
• Only small portion do data is interesting
• Content, date, author, keywords
• Proprietary algorithm for data extraction
• Social Media APIs

Sites we want to crawl
• Unvisited new sites with new content (i.e. new articles)
• Sites with links to sites with new content (i.e. forum thread list)
• Known sites with fresh content (hot articles with many comments)

Crawling strategy
• Minimize time between publication of post and its discovery by
crawler
• Maximize number of extracted posts
• Minimize website hits

Crawling queue
• The longer crawler works the better decision it makes
• 500 000+ monitored domains
• Enormous website graphs
• Sorted queue in RAM
• Limit on depth
• Unsorted list of URLs in HBase
• MapReduce jobs for creating job for crawler

Lingering problems
• Generated traffic
• Proper detection of website size
• Kindness parameter – robots.txt crawl-delay
• Sites with bad HTML
• Dynamic websites
• Monitoring and storing performance statistics
• USER AGENT: SentiBot www.sentibot.eu (compatible with Googlebot)

Research project
• Optimizing crawling is part of research grant done in collaboration
with Gdańsk University of Technology
• Funded by The National Centre for Research and Development

Data cluster
Dos and don’ts

ElasticSearch Tech specs
• 1 document type
• 28TB with replication
• 14.3B documents
• 848 shards
• 78 indices (partitioned by date and language)
• 48 nodes

Load
• 2k search requests per second
• 1k index requests per second
• Approximately 75% of document insterts are duplicates
• Up to 45M new documents daily

Configuration of each node
• 2 or 3 SSD disks in SOFT RAID 0
• CPU 4 or 8 physical cores (Intel Xeon D 1540/1520)
• 64GB RAM
• 30GB of heap for JVM running Elastic due to pointer compression
• Half of RAM used by OS filesystem cache (do not give JVM whole
RAM)
• Scaling performance by adding nodes to cluster

Why ElasticSearch?
• Solr is single node
• Cloud version was still unstable at that time
• ye good ol’ Lucene
• Free and ready to use
• Fair enough documentation
• Automatic rebalancing and redundancy
• Quickly develops, strong community

Main flaws of ElasticSearch
• Very fragile to network issues!
• Split brain
• Upgrade often requires reindexing data
• Schema changes not allowed without reindexing (only adding new fields)
• Thinks twice before setting mapping/tokenizers/analizers
• Make sure you monitor and store every cluster property (i.e. with Grafana)
• No out of the box security (changed in newest 5.x)
• Heavy queries may cause cluster failure
• Problems with memory leaks
• Load balancing is not perfect, some nodes/shards can be more ‘hot’

Recommended Plugins
• Kopf – better than head
• Stempel – stemming for Polish
• ICU – transliteration for Greek and Cyrillic alphabets
• Worddelimiter2 – can handle more cases
• Decompound – Very useful in German
• Combo – combining multiple analyzers
• extended-analyze – debugging
• inquisitor - quick debugging
• Migration – checking what needs to be changed before migration

Ideas for the future
• Upgrade to newest version
• Move nodes to VLAN (vRack)
• Split master/data/client nodes

Cassandra
• 1:1 relationship with data in ElasticSearch
• Used to store volatile metadata (number of likes, comments, views
etc.)
• Deployed on 8 nodes, 4TB of data
• Searching only by key fields
• Optimized for environments with more writes than reads

Things that didn’t work
Shame! Shame! Shame!

Scaling ES cluster in two locations
• Cluster distributed between Poland and France
• Network instability

Not setting proper heap size for ElasticSearch
• Heap related problems? Increase heap size? No!
• More memory is not always better
• Keep below 31GB to have smaller pointers

Not monitoring every variable of
ElasticSearch
• All status variables should be stored for comparison in case of a
downtime
• Default plugins show only current status
• Compare ES data with OS variables (i.e. load, io_wait, network)

Not partitioning ES indices
• The less data you query the better
• Most calls are for newest data
• Gathers statistics on queries and partition indices accordingly!
• No easy metric to choose shard size/number

Scaling ES cluster vertically
• No point in using servers with more than 64GB RAM
• Smaller number of nodes is easier to manage only in theory
• High IO on single node

Not setting proper compaction strategy in
Cassandra
• Changing from Size-Tiered to Leveled Compaction
• Data was never compacted
• Adding new nodes to cluster just to finish compaction
• Cluster downtime due to disk space problems

Abandoned technologies
• H2 Server – replaced with MariaDB, replaced with Percona
• Solr – replaced with ElasticSearch
• ActiveMQ – soon!
• Google Search API – replaced with YaCy
• Java -> Scala
• Maven -> SBT
• Grails -> Play!
• Vaadin -> ReactJS

Overhyped technologies
• Hadoop cool… but not so long ago it was…
• Poorly documented
• Expensive to run, requires multiple nodes
• Overcomplicated
• Unstable

Thank you
Michał Brzezicki
michal@sentione.com
https://pl.linkedin.com/in/brzezicki
+48 603 926 001

That won’t fit into RAM - Michał Brzezicki

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a That won’t fit into RAM - Michał Brzezicki

Semelhante a That won’t fit into RAM - Michał Brzezicki (20)

Mais de Evention

Mais de Evention (20)

Último

Último (20)

That won’t fit into RAM - Michał Brzezicki