Mais conteúdo relacionado Semelhante a Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamSets Open Source Community (20) Mais de Pat Patterson (20) Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamSets Open Source Community1. 1© StreamSets, Inc. All rights reserved.
Project Ouroboros
Using StreamSets Data Collector to Help Manage
the StreamSets Open Source Community
Pat Patterson / Director of Evangelism
@metadaddy / pat@streamsets.com
2. 2© StreamSets, Inc. All rights reserved.
Who Am I?
Pat Patterson / pat@streamsets.com / @metadaddy
Past: Sun Microsystems, Salesforce
Present: Director of Evangelism, StreamSets
I run far 🏃♂️
3. 3© StreamSets, Inc. All rights reserved.
Who is StreamSets?
Seasoned leadership team Customer base from global
8000
50%
Unique commercial
downloaders
2000+
Open source downloads
worldwide
3,000,000+
Broad connectivity
50+
History of innovation
streamsets.com/about-us
6. 6© StreamSets, Inc. All rights reserved.
Parse Fastly CDN logs
Extract records relating to downloads
Gain insights
Companies downloading the binaries
Geographic reach
Metrics for different binary artifacts
Objective
7. 7© StreamSets, Inc. All rights reserved.
Bash script to download S3 objects using AWS CLI tool
sed, grep, sort, uniq, awk, diff, xargs, curl
Complex, hard to maintain, slow, essentially ‘write-only’ code
cut -f 1 -d ' ' merge.log|sort|uniq > ips
diff --new-line-format="" --unchanged-line-
format="" ips allips > newips
cat newips|xargs -L 1 -I% curl -s
http://ipinfo.io/%/org|cut -f 2- -d '
'|sort|uniq>orgs && subl orgs
Before
8. 8© StreamSets, Inc. All rights reserved.
Mission creep
Inertia
Why???
Image Nyah S / Pexels / Pexels License
9. 9© StreamSets, Inc. All rights reserved.
Data Flow
StreamSets
Data Collector
↘
↘
Amazon S3
MySQL
10. 10© StreamSets, Inc. All rights reserved.
Parse Fastly CDN log lines, send data to MySQL
<134>2017-07-09T12:01:13Z cache-sjc3636
StreamSetsS3Bucket[60550]: 104.155.191.102 "-" "-"
Sun, 09 Jul 2017 12:01:12 GMT GET
/datacollector/latest/parcel/manifest.json 200 1295
Let’s Get Started!
11. 11© StreamSets, Inc. All rights reserved.
Grok Patterns are designed for exactly this!
Standard patterns for timestamps, HTTP verbs, filenames
<%{NUMBER:priority}>%{TIMESTAMP_ISO8601:timestamp}
%{HOSTNAME:cachenode}
%{WORD:logname}[%{NUMBER:pid}]: %{IP:ip} "-" "-"
%{DATESTAMP_FASTLY:datestamp} %{WORD:verb}
%{PATH:file} %{NUMBER:code} %{SIZE_OR_NULL}
Simple, Right?
13. 13© StreamSets, Inc. All rights reserved.
What??? An HTTP request isn’t supposed to include the protocol like that!
Fastly records whatever the client sends, no matter how dumb.
But...
Record1-Error SERVICE_ERROR_001 - Cannot parse record from message 'rawData':
com.streamsets.pipeline.api.service.dataformats.DataParserException:
LOG_PARSER_03 - Log line '<134>2017-07-09T12:01:13Z cache-sjc3636
StreamSetsS3Bucket[60550]: 104.155.191.102 "-" "- Sun, 09 Jul 2017 12:01:12 GMT
GET
https://archives.streamsets.com/datacollector/latest/parcel/STREAMSETS_DATAC
OLLECTOR-1.1.4-el6.parcel 404 0' does not conform to 'Grok Format
14. 14© StreamSets, Inc. All rights reserved.
<%{NUMBER:priority}>%{TIMESTAMP_ISO8601:timestamp}
%{HOSTNAME:cachenode}
%{WORD:logname}[%{NUMBER:pid}]: %{IP:ip} "-" "-"
%{DATESTAMP_FASTLY:datestamp} %{WORD:verb}
%{NOTSPACE:file} %{NUMBER:code} %{SIZE_OR_NULL}
Solution: Be Permissive with your Input
15. 15© StreamSets, Inc. All rights reserved.
Even if you think you know the data
schema - test with real data!
First Lesson Learned
18. 18© StreamSets, Inc. All rights reserved.
Solution: Duplicate the Data
CREATE TABLE download (
id int(11) AUTO_INCREMENT,
ip varchar(64),
date datetime,
file varchar(767),
PRIMARY KEY (`id`),
KEY `date_idx` (`date`),
KEY `file_idx` (`file`)
);
23. 23© StreamSets, Inc. All rights reserved.
Lookup company details from IP via Kickfire API
What’s Next?
25. 25© StreamSets, Inc. All rights reserved.
com.streamsets.pipeline.api.base.OnRecordErrorException: HTTP_01 -
Error fetching resource. Status: 429 Reason: You have reached the
maximum calls per second
org.glassfish.jersey.message.internal.EntityInputStream@4cb3922b
But...
Kickfire API is rate limited!
To deliver optimum performance to all of our API customers, KickFire
balances transaction loads by using rate limits
27. 27© StreamSets, Inc. All rights reserved.
com.streamsets.pipeline.api.base.OnRecordErrorException: HTTP_01 - Error
fetching resource. Status: 429 Reason: You have reached the maximum calls
per month org.glassfish.jersey.message.internal.EntityInputStream@4cb3922b
But...
Kickfire API has a monthly call limit!
29. 29© StreamSets, Inc. All rights reserved.
Know your API’s
non-functional constraints!
Third Lesson Learned
31. 31© StreamSets, Inc. All rights reserved.
Leave to run for a few weeks...
Image © Itzuvit / Wikimedia Commons / CC-BY-SA-3.0
32. 32© StreamSets, Inc. All rights reserved.
com.streamsets.pipeline.api.base.OnRecordErrorException: HTTP_01 -
Error fetching resource. Status: 429 Reason: You have reached the
maximum calls per month
org.glassfish.jersey.message.internal.EntityInputStream@4cb3922b
But...
Kickfire’s monthly call limit strikes again!
33. 33© StreamSets, Inc. All rights reserved.
Root Cause
Seeing large numbers of downloads from the same few IP addresses
Data Collector has a microbatch architecture - database writes are
committed at the end of the batch
New IP address isn’t visible in the database until the start of the next batch
Still making repeated requests to Kickfire for the same IP address!
35. 35© StreamSets, Inc. All rights reserved.
Data Collector operates batch-by-batch
-
design your pipelines accordingly!
Fourth Lesson Learned
38. 38© StreamSets, Inc. All rights reserved.
No plan survives first
contact with the enemy
Helmuth von Moltke the Elder, "On Strategy"
(1871)
Ultimate Lesson Learned
Image in the public domain
40. 40© StreamSets, Inc. All rights reserved.
Everybody has a plan
until they get punched
in the mouth
Mike Tyson (1987)
Ultimate Lesson Learned
Image © Abelito Roldan / Flickr / CC BY 2.0
41. 41© StreamSets, Inc. All rights reserved.
September 3-5, 2019
Tue, Sep 3 - Training & Tutorials
Wed-Thu, Sep 4-5, Keynote & Breakouts
Hilton Financial District
(Tue|Wed|Thur)
43. 43© StreamSets, Inc. All rights reserved.
Thank you
43© StreamSets, Inc. All rights reserved.
Pat Patterson / Director of Evangelism
@metadaddy / pat@streamsets.com