Mais conteúdo relacionado Semelhante a IC-SDV 2019: Distributing AI to the Amazon Cloud - Klaus Kater (Deep SEARCH 9, Germany ) (20) Mais de Dr. Haxel Consult (20) IC-SDV 2019: Distributing AI to the Amazon Cloud - Klaus Kater (Deep SEARCH 9, Germany )1. 1 © 2019 Deep SEARCH 9 GmbH1
Deep SEARCH 9
Distributing AI to the Amazon cloud
IC-SDV 2019 08 - 09 April Nice, France
Klaus Kater
Deep SEARCH 9 GmbH
Managing Partner
https://deepsearchnine.com
2. 2 © 2019 Deep SEARCH 9 GmbH2
Sources
Surface Web
Deep Web
Databases
Repositories
Scheduled
execution
Unattendedretrieval/crawling
Prepare semantic search
Automatic publication
Deep SEARCH 9
Information Consumers
Ontology management
SEARCHCORPORA
• Biotech
• CROs
• Digital Therapeutics
• Technology Transfer Offices
• Clinical trials
• Other scopes of information
• Known (trusted) sources
• More complete
• Faster
Search applications for specific
scopes of information
3. 3 © 2019 Deep SEARCH 9 GmbH3
Why moving to the cloud?
DS9 needs more and more resources…
2015 2016 2017 2018 2019
• 30.000 company websites
• Link depth 3
• Once every 3 months
• ca. 50 GB of data
• 60.000 company websites
• Link depth 5
• Every month
• ca. 1 TB of data
…because our search engines keep gobbling information
like the cookie monster gobbles cookies!
• 250.000 company websites
• Link depth 5
• Twice a month
• ?
4. 4 © 2019 Deep SEARCH 9 GmbH4
Therefore we need:
The only place to get all of this, is the cloud!
More CPU power
Content classification
Semantic tagging
Machine learning
Faster networks High bandwidth requirements
Network latency problems
Scalability
Availability and responsiveness for users
CPU during analysis
Bandwidth during crawling
5. 5 © 2019 Deep SEARCH 9 GmbH5
More CPU power
6. 6 © 2019 Deep SEARCH 9 GmbH6
More CPU power
EC2 Dynamic Scaling price per hour hours yearly budget
EC2 r5.4xlarge + 2 TB SSD 1,22 € 8.250 10.065 €
Bare metal hardware price per month hours yearly budget
Bare metal hardware 839,00 € 8.760 10.068 €
7. 7 © 2019 Deep SEARCH 9 GmbH7
More CPU power
But we need to be able to do the job in about 2 days
EC2 Runtime Compared to bare metal server Budget (year) Concurrent DS9 nodes Hours / day Hours / month Hours / year
EC2 10 instances 10.065 € 10 2 69 825
EC2 20 instances 10.065 € 20 1 34 413
EC2 50 instances 10.065 € 50 - 14 165
EC2 100 instances 10.065 € 100 - 7 83
EC2 Dynamic Scaling price per hour hours yearly budget
EC2 r5.4xlarge + 2 TB SSD 1,22 € 8.250 10.065 €
Bare metal hardware price per month hours yearly budget
Bare metal hardware 839,00 € 8.760 10.068 €
20x as much CPU for the same price!
8. 8 © 2019 Deep SEARCH 9 GmbH8
Next bullet point: Faster networks
Viewers show the global
distribution of companies
in our SEARCHCORPORA
Obviously there are many
activities in Japan (JPN),
India (IND), China (CHN),
Korea (KOR), Hong Kong
(HKG), Iran (IRN), Pakistan
(PAK), Taiwan (TWN),
Malaysia (MYS),
Bangladesh (BGD),
Singapore (SGP), …
9. 9 © 2019 Deep SEARCH 9 GmbH9
Faster networks
Note, how Tokyo and Seattle have the same distance to our servers
(9.300 km) as have Boston and New Delhi (6.000 km) but network
latency is much higher going east or south
Ping time from DS9 server
Circles are simply squeezed to compensate for Mercator distortion
10. 10 © 2019 Deep SEARCH 9 GmbH10
Faster networks
But can we make the network connection faster?
Simple calculation
Typical page: 30 kB
Typical webserver: 500 pages
Transferring 1 page from Tokyo: 1.200ms
500 pages: 500 x 1.200ms = 10 minutes
1.000 servers: 6 days 23 hours
From Tokyo
Transferring 1 page from London: 82ms
500 pages: 500 x 82ms = 41 seconds
1.000 servers: 11,5 hours
From London
11. 11 © 2019 Deep SEARCH 9 GmbH11
No. But we could distribute DS9!
We can distribute DS9 instances across the world using the Amazon cloud
This map shows the Amazon EC2 computer center locations
12. 12 © 2019 Deep SEARCH 9 GmbH12
Distributing DS9
We can distribute DS9 instances across the world using the Amazon cloud
This map shows the Amazon EC2 computer center locations
13. 13 © 2019 Deep SEARCH 9 GmbH13
Challenges
Use standards or develop proprietary?
Hadoop is what one thinks of when hearing distributed analytics…
MapReduce algorithms are good at distributing cut down analytics tasks across
multiple CPUs. This is what we would use on the filter step level. But it is not suited to
distribute whole filter chains with arbitrary analytics tasks like text annotation with
ontologies or Deep Web crawling with real-time constructed URLs
How can we minimize I/O operations?
I/O operations – especially indexing of data – and data transfer are the
bottlenecks and could potentially eat up all benefits coming from distribution
1. Data must be read only once from the DS9 backend (no copying)
2. Data must be transferred in compressed chunks (to overcome latency issues)
3. Data must be indexed only once at the final destination on the DS9 backend
14. 14 © 2019 Deep SEARCH 9 GmbH14
DS9 standard node
Distributing DS9 instances
ds9App
Frontdoor
• Webserver
Firewall
Browserfarm
• DS9
• MySQL
• Elasticsearch
• Blazegraph
• DS9 Farming
• MySQL
• DS9 App
• MySQL
Frontdoor
• Webserver
Firewall
• DS9
• MySQL
• Blazegraph
DS9 distributed node
Smaller footprint!
15. 15 © 2019 Deep SEARCH 9 GmbH15
Two new types of DS9 jobs were implemented:
That‘s what we always did
Execute a job from main DS9 host remotely
on some other DS9 host for load distribution
Execute a job from main DS9 host on a
dynamically allocated cluster of EC2 instances
that have DS9 Solutions installed
Controlled by DS9 Farming
URLs read from DS9 main host
Results written back to DS9 main host
Start 20 nodes in DS9 cluster mode
Use t3.xlarge node type (4 VCPUs, 96GB)
Run all instances at Amazon in Tokyo
DS9 EC2 cloud clusters
16. 16 © 2019 Deep SEARCH 9 GmbH16
ds9App
Frontdoor
• Webserver
Firewall
Browserfarm
DS9 / IDE
DS9 standard installation
Instances are dynamically
allocated, deployed and
started, jobs are executed
and at the end all
instances are terminated
Accounting
Dynamic cloud allocation
powered by• DS9
• MySQL
• Elasticsearch
• Blazegraph
• DS9 Farming
• MySQL
• DS9 App
• MySQL
• DS9
• MySQL
• Blazegraph
Each node is a full installation of
DS9 Solutions (without Elasticsearch)
Finally fully scalable (this satisfies our 3rd need)
AWS Region Tokyo
20x – deployment takes < 5 minutes
17. 17 © 2019 Deep SEARCH 9 GmbH17
1. export
DS9 Farming
2. unpack
Claim
containers
input
powered by
DS9 Solutions
• DS9
• MySQL
• Blazegraph
DS9 Solutions
• DS9
• MySQL
• Blazegraph
DS9 Solutions
• DS9
• MySQL
• Blazegraph
DS9 Solutions
• DS9
• MySQL
• Blazegraph
3. start nodes
4. import job
5. execute job
remote read
equally distribute URLs
among EC2 nodes
write
cache
remote write
Only move necessary resources to EC2
Execute Distributed Job
6. stop nodes
…
18. 18 © 2019 Deep SEARCH 9 GmbH18
Sources
Information Scientists
SEARCHCORPORA
• Start-ups
• Competitors
• Regulatory
• New technology
• …
Scheduled
execution
Unattendedupdates
Automatic publication
• Known (trusted) sources
• More complete
• Faster
Managed Intelligence 2018
• Information source selection
• Content structuring
• Linking of disparate sources
• Ontology management
• SEARCHCORPUS® management
Search Competence Center
Information Consumers
Internal Customers
Expertise of information scientist needed
Unattended automatic execution of jobs
Sources
Surface Web
Deep Web
Databases
Repositories Prepare semantic search
Ontology management
19. 19 © 2019 Deep SEARCH 9 GmbH19
Company repositories
e.g. Crunchbase
Master SEARCHCORPUS®
• Hundreds of thousands of websites
• Many Million web pages
• PDF-based publications
• Structured data
• Other sources
Extraction using
Lucene query +
classification
SEARCHCORPORA
• Biotech
• CROs
• Digital Therapeutics
• Technology Transfer Offices
• Clinical trials
• Other scopes of information
Customize for
research target
Automatic publication:
• target specific focus Information Consumers
Internal CustomersQuality assurance
Qualification
SEARCHCORPUS®
Crawling and automatic
classification for
classes of interest
Classified targets
Managed Intelligence 2019
Fully distributed
Expertise of information scientist needed
Crawling
Unattended automatic execution of jobs
Distributed automatic execution of jobs
Information Scientists
Search Competence Center
Surface / Deep
Web
20. 20 © 2019 Deep SEARCH 9 GmbH20
Deep SEARCH 9
Distributing AI to the Amazon cloud
IC-SDV 2019 08 - 09 April Nice, France
Klaus Kater
Deep SEARCH 9 GmbH
Managing Partner
https://deepsearchnine.com