7. The Case Study
Find suspicious government contracts
using heuristics
IT contract where price > 1M euro
Supplier company age < 3 months
using crowdsourcing
Data
Central government contract repositories
www.crz.gov.sk, zmluvy.egov.sk
~70K contracts in 8 months
100+ GB pdf/doc/scan
10. The Solution
Faceted search
Search
e.g. Find all contracts by Orange Slovakia
Analyze
e.g. Which department has most contracts
with Orange Slovakia?
e.g. What is the contract price distribution
for Orange Slovakia?
…
Define penalty heuristics
14. Percolate
Problem
New contract/document added, which heuristics does it
match?
Solution
1. Save heuristics/searches in percolator index
2. Percolate new documents
16. Scroll
Problem
New heuristic added and matches many (1K+) documents
Add heuristic to all matching documents
+ Offset performance problem known in RDBMS
Solution
Use async background job
Scroll through results (a.k.a. cursor)
18. Ruby Scroll API
Mimics find_each in ActiveRecord
def find_each(query, &block)
scroll_id = nil
processed = 0
begin
unless scroll_id
result = initiate_scroll(query)
scroll_id = result.scroll_id
else
result = scroll(scroll_id)
end
result.hits.each do |document|
yield document
end
processed += result.hits.size
end while processed < result.hits.total
end