SentiOne is one of the leading solutions in Europe for social media listening and analysis. We monitor over 26 European markets including CEE, Scandinavia, DACH, and the Balkans. The amount of data that is processed every day and is ready to be queried by our users is enormous. Over the years we have tested many technologies and approaches in big data from which many have failed. The presentation includes our experiences and lessons learned on setting up big data company from scratch. I will give details on configuring robust ElasticSearch cluster with over 26TB of data and describe key challenges in efficient web crawling and data extraction
6. FEBRUARY9, 2017, WARSAW
Deployed on…
• 201 dedicated servers at OVH
• OVH (77 servers)
• SoYouStart (14 servers)
• Kimsufi (110 servers)
• 25 virtual cloud based machines using Microsoft Azure
7. FEBRUARY9, 2017, WARSAW
Brief architecture description
• Crawling servers gathering data from WWW and various APIs
• Intermediate analysing layer
• Database
• Web application with GUI for analysing data
9. FEBRUARY9, 2017, WARSAW
Gathering data
• WWW
• Millions of domains
• Each HTML site is different
• Thousands of date formats
• Searching for new domains
• Only small portion do data is interesting
• Content, date, author, keywords
• Proprietary algorithm for data extraction
• Social Media APIs
10. FEBRUARY9, 2017, WARSAW
Sites we want to crawl
• Unvisited new sites with new content (i.e. new articles)
• Sites with links to sites with new content (i.e. forum thread list)
• Known sites with fresh content (hot articles with many comments)
11. FEBRUARY9, 2017, WARSAW
Crawling strategy
• Minimize time between publication of post and its discovery by
crawler
• Maximize number of extracted posts
• Minimize website hits
12. FEBRUARY9, 2017, WARSAW
Crawling queue
• The longer crawler works the better decision it makes
• 500 000+ monitored domains
• Enormous website graphs
• Sorted queue in RAM
• Limit on depth
• Unsorted list of URLs in HBase
• MapReduce jobs for creating job for crawler
13. FEBRUARY9, 2017, WARSAW
Lingering problems
• Generated traffic
• Proper detection of website size
• Kindness parameter – robots.txt crawl-delay
• Sites with bad HTML
• Dynamic websites
• Monitoring and storing performance statistics
• USER AGENT: SentiBot www.sentibot.eu (compatible with Googlebot)
14. FEBRUARY9, 2017, WARSAW
Research project
• Optimizing crawling is part of research grant done in collaboration
with Gdańsk University of Technology
• Funded by The National Centre for Research and Development
16. FEBRUARY9, 2017, WARSAW
ElasticSearch Tech specs
• 1 document type
• 28TB with replication
• 14.3B documents
• 848 shards
• 78 indices (partitioned by date and language)
• 48 nodes
17. FEBRUARY9, 2017, WARSAW
Load
• 2k search requests per second
• 1k index requests per second
• Approximately 75% of document insterts are duplicates
• Up to 45M new documents daily
18. FEBRUARY9, 2017, WARSAW
Configuration of each node
• 2 or 3 SSD disks in SOFT RAID 0
• CPU 4 or 8 physical cores (Intel Xeon D 1540/1520)
• 64GB RAM
• 30GB of heap for JVM running Elastic due to pointer compression
• Half of RAM used by OS filesystem cache (do not give JVM whole
RAM)
• Scaling performance by adding nodes to cluster
19. FEBRUARY9, 2017, WARSAW
Why ElasticSearch?
• Solr is single node
• Cloud version was still unstable at that time
• ye good ol’ Lucene
• Free and ready to use
• Fair enough documentation
• Automatic rebalancing and redundancy
• Quickly develops, strong community
20. FEBRUARY9, 2017, WARSAW
Main flaws of ElasticSearch
• Very fragile to network issues!
• Split brain
• Upgrade often requires reindexing data
• Schema changes not allowed without reindexing (only adding new fields)
• Thinks twice before setting mapping/tokenizers/analizers
• Make sure you monitor and store every cluster property (i.e. with Grafana)
• No out of the box security (changed in newest 5.x)
• Heavy queries may cause cluster failure
• Problems with memory leaks
• Load balancing is not perfect, some nodes/shards can be more ‘hot’
21. FEBRUARY9, 2017, WARSAW
Recommended Plugins
• Kopf – better than head
• Stempel – stemming for Polish
• ICU – transliteration for Greek and Cyrillic alphabets
• Worddelimiter2 – can handle more cases
• Decompound – Very useful in German
• Combo – combining multiple analyzers
• extended-analyze – debugging
• inquisitor - quick debugging
• Migration – checking what needs to be changed before migration
23. FEBRUARY9, 2017, WARSAW
Ideas for the future
• Upgrade to newest version
• Move nodes to VLAN (vRack)
• Split master/data/client nodes
24. FEBRUARY9, 2017, WARSAW
Cassandra
• 1:1 relationship with data in ElasticSearch
• Used to store volatile metadata (number of likes, comments, views
etc.)
• Deployed on 8 nodes, 4TB of data
• Searching only by key fields
• Optimized for environments with more writes than reads
26. FEBRUARY9, 2017, WARSAW
Scaling ES cluster in two locations
• Cluster distributed between Poland and France
• Network instability
27. FEBRUARY9, 2017, WARSAW
Not setting proper heap size for ElasticSearch
• Heap related problems? Increase heap size? No!
• More memory is not always better
• Keep below 31GB to have smaller pointers
28. FEBRUARY9, 2017, WARSAW
Not monitoring every variable of
ElasticSearch
• All status variables should be stored for comparison in case of a
downtime
• Default plugins show only current status
• Compare ES data with OS variables (i.e. load, io_wait, network)
29. FEBRUARY9, 2017, WARSAW
Not partitioning ES indices
• The less data you query the better
• Most calls are for newest data
• Gathers statistics on queries and partition indices accordingly!
• No easy metric to choose shard size/number
30. FEBRUARY9, 2017, WARSAW
Scaling ES cluster vertically
• No point in using servers with more than 64GB RAM
• Smaller number of nodes is easier to manage only in theory
• High IO on single node
31. FEBRUARY9, 2017, WARSAW
Not setting proper compaction strategy in
Cassandra
• Changing from Size-Tiered to Leveled Compaction
• Data was never compacted
• Adding new nodes to cluster just to finish compaction
• Cluster downtime due to disk space problems
32. FEBRUARY9, 2017, WARSAW
Abandoned technologies
• H2 Server – replaced with MariaDB, replaced with Percona
• Solr – replaced with ElasticSearch
• ActiveMQ – soon!
• Google Search API – replaced with YaCy
• Java -> Scala
• Maven -> SBT
• Grails -> Play!
• Vaadin -> ReactJS
33. FEBRUARY9, 2017, WARSAW
Overhyped technologies
• Hadoop cool… but not so long ago it was…
• Poorly documented
• Expensive to run, requires multiple nodes
• Overcomplicated
• Unstable
34. FEBRUARY9, 2017, WARSAW
Thank you
Michał Brzezicki
michal@sentione.com
https://pl.linkedin.com/in/brzezicki
+48 603 926 001