O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Cloudstone - Sharpening Your Weapons Through Big Data

532 visualizações

Publicada em

These slides were part of a presentation given at HushCon East 2017. The talk covered how we can use big data to improve the effectiveness of offensive security tools.

Publicada em: Internet
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Cloudstone - Sharpening Your Weapons Through Big Data

  1. 1. Cloudstone Sharpening Your Weapons Through Big Data Christopher Grayson @_lavalamp + =
  2. 2. Introduction
  3. 3. WHOAMI 3 • ATL • Web development • Academic researcher • Haxin’ all the things • (but I rlllly like networks) • Founder • Red team @_lavalamp
  4. 4. • Common Crawl • MapReduce • Hadoop • Amazon Elastic MapReduce (EMR) • Mining Common Crawl using Hadoop on EMR • Other ”big” data sources WHAT’S DIS 4
  5. 5. • Academic research =/= industry research • Tactics can (and should!) be cross- applied • Lots of power in big data, only problem is how to extract it • Largely untapped resource • Content discovery (largely) sucks WHY’S DIS 5
  6. 6. 1. Background 2. Common Crawl 3. MapReduce & Hadoop 4. Elastic Map Reduce 5. Mining Common Crawl 6. Data Mining Results 7. Big(ish) Data Sources 8. Conclusion Agenda 6
  7. 7. Background
  8. 8. • DARPA CINDER program • Continual authentication through side channel data mining • Penetration testing • Web Sight My Background 8
  9. 9. • Penetration testing scopes are rarely adequate • Faster, more accurate tools == better engagements • It’s 2017 – application layer often comprises the majority of attack surface • Expedite discovery of application- layer attack surface Time == $$$ 9
  10. 10. • Many web applications map disk contents to URLs • Un-linked resources are commonly less secure • Older versions • Debugging tools • Backups with wrong extensions • Find via brute force • Current tools are quite lacking Web App Content Discovery 10
  11. 11. Common Crawl
  12. 12. • California-based 501(c)(3) non- profit organization • Performing full web crawls on a regular basis using different user agents since 2008 • Data stored in AWS HDFS • A single crawl contains many terabytes of data • Full crawl metadata can exceed 10TB What is Common Crawl? 12 http://commoncrawl.org/
  13. 13. • Crawl data is stored in three proprietary data formats • WARC (Web ARChive) – raw crawl data • WAT – HTTP request and response metadata • WET – plain-text HTTP responses • WAT files likely contain the juicy bits you’re interested in • Use existing libraries for parsing file contents CC Data Format 13
  14. 14. • Data is stored in AWS HDFS (S3) http://commoncrawl.org/the- data/get-started/ • Can use the usual AWS S3 command line tools for debugging • Newer crawls contain files listing WAT and WET paths CC HDFS Storage 14
  15. 15. • When running Hadoop jobs, HDFS path is supplied to identify all files to process • Pulling down single files and checking them out helps with debugging code • Use AWS S3 command line tool to interact with CC data Accessing HDFS in AWS 15 aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2017-17/wat.paths.gz . aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2017-17/
  16. 16. MapReduce & Hadoop
  17. 17. • Programming model for processing large amounts of data • Processing done in two phases: • Map – take input data and extract what you care about (key-value pairs) • Reduce – apply a simple aggregation function across the mapped data (count, sum, etc) • Easy concept, quirky to get what you need out of it What is MapReduce? 17 https://en.wikipedia.org/wiki/MapReduce
  18. 18. • Apache Hadoop • De-facto standard open source implementation of MapReduce • Written in Java • Has an interface to process data in other languages, but writing code in Java comes with perks How ‘bout Hadoop? 18
  19. 19. • Use the Hadoop library for the version you’ll be deploying against • Implement Tool and Configured class • Implement mapper and reducer classes • Configure data types and input/output paths • ??? • Profit Writing Hadoop Code 19
  20. 20. • MapReduce supports the map -> reduce paradigm • This is a fairly constrictive paradigm • Have to be creative to determine what to do during both the map and reduce phases to extract and aggregate the data you care about Shoehorning into Hadoop 20
  21. 21. Elastic Map Reduce
  22. 22. • EMR • Amazon’s cloud service for running Hadoop jobs • Usage of all the standard AWS tools • Set up a cloud of EC2 instances to process your data • Free access to data stored in S3 Elastic MapReduce?! 22
  23. 23. • Choose how much you want to pay for EC2 instances • EMR allows you to use spot pricing for your instances • Must have one or two master nodes alive at all time (no spot pricing) • Choose the right spot price and your total cost for processing all of Common Crawl can be <$100.00 Spot Pricing!!! 23
  24. 24. Mining Common Crawl
  25. 25. • We want to find the most common URL paths for every server type • We have access to HTTP request and response headers • We must find a way to map our requirements into the map and reduce phases • Map – Collect/generate the data we care about, fit into key-value pairs • Reduce – Apply a mathematical aggregation across the collected data Here Comes the Shoehorn 25
  26. 26. MAP • Create unique strings that contain (1) a reference to the type of server and (2) the URL path segment for every URL path segment in ever URL found within the CC HTTP responses REDUCE • Count the number of instances of each unique string My Solution 26
  27. 27. • Working with big data requires coercion of input data to expected values • Aggregating on random data == huge output files • For processing CC data, I had to coerce the following values to avoid massive result files • Server headers • GUIDs in URL paths • Integers in URL paths Mapping URL Paths 27
  28. 28. • People put wonky stuff in server headers • Reviewed the contents of a few WAT files and retrieved all server headers • Chose a list of server types to support • Coerce header values into list of supported server types • Not supported -> misc_server • No server header -> null_server Coercing Server Headers 28
  29. 29. • URL paths can contain regularly randomized data • Dates • GUIDs • Integers • Replace URL paths with default strings when • Length exceeds 16 • Contents all integers • Contents majority integers Coercing URL Paths 29
  30. 30. Mapping process results in strings containing coerced server header and URL path 1. Record type 2. Server type 3. URL path segment Mapping Result Key 30 < 02_';)_apache_generic_';)_ AthenaCarey > 1 2 3
  31. 31. Mapping Example 31 GET /foo/bar/baz.html?asd=123 HTTP/1.1 Host: www.woot.com User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0 Accept: text/html Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate Server: Apache/2.4.9 (Unix) Connection: close Upgrade-Insecure-Requests: 1 /foo/bar/baz.html on Apache (Unix) < 02_';)_apache_unix_';)_/foo/>, 1 < 02_';)_apache_unix_';)_/bar/>, 1 < 02_';)_apache_unix_';)_baz.html>, 1
  32. 32. • Swap out the fileInputPath and fileOutputPath values in HadoopRunner.java • Compile using ant (not Eclipse, unless you really like tearing your hair out) • Upload Hadoop JAR file to AWS S3 • Create EMR cluster • Add a “step” to EMR cluster referencing the JAR file in AWS S3 Running in EMR 32
  33. 33. • Processing took about two days using five medium-powered EC2 instances as task nodes • 93,914,151 results (mapped string combined with # of occurences) • ~3.6GB across 14 files • Still fairly raw data – we need to process it for it to be useful Resulting Data 33
  34. 34. • We effectively have tuples of server types, URL path segments, and the number of occurrences for each server type and segment pair • Must process the results and order by most common path segments • Parsing code can be found here: Parsing the Results 34 https://github.com/lavalamp-/lava-hadoop-processing
  35. 35. Data Mining Results
  36. 36. URL Segment Counts 36 500,000.00 5,000,000.00 50,000,000.00 500,000,000.00 Gunicon Thin Openresty Zope Lotus Domino Sun Web Server Apache (Windows) Jetty PWS Lighttpd IBM HTTP Server Resin Oracle Application Server Litespeed Miscellaneous IIS Nginx Apache (Unix) Apache (Generic) # of Discovered URL Segments ServerType # of URL Segments by Server Type
  37. 37. Coverage Server Type 50% 75% 90% 95% 99% 99.70% 99.90% Apache (Generic) 58 217 475 611 749 776 784 Apache (Unix) 53 189 395 502 604 624 629 Apache (Windows) 14 41 78 97 117 121 122 Gunicon 2 4 5 6 6 6 6 IBM HTTP Server 6 15 21 24 26 26 26 IIS 103 330 610 738 859 882 889 Jetty 4 10 15 17 19 19 19 Lighttpd 20 76 178 240 306 320 324 Litespeed 16 43 73 90 109 113 114 Lotus Domino 3 5 6 7 7 7 7 Miscellaneous 93 329 687 907 1147 1196 1210 Nginx 87 341 760 1005 1284 1343 1360 Openresty 7 31 97 159 271 306 318 Oracle Application Server 1 4 6 6 7 7 7 PWS 6 15 22 25 28 29 29 Resin 1 5 9 10 12 12 12 Sun Web Server 6 11 14 16 17 17 17 Thin 3 6 10 11 12 13 13 Zope 12 25 37 42 47 47 48 Coverage by # of Requests 37
  38. 38. Most Common URL Segments 38 index.php /forum/ /forums/ /news/ viewtopic.php showthread.php /tag/ /index.php/ newreply.php /cgi-bin/ Apache (Unix) index.php index.cfm /uhtbin/ /cgisirsi.exe/ /NCLD/ /catalog/ modules.php /events/ /forum/ /item/ Apache (Windows) /news/ index.php /wiki/ /forums/ /forum/ /tag/ /search/ showthread.php viewtopic.php /en/ Apache (Generic) /article/ /news/ /page/ /id/ default.aspx /products/ /NEWS/ /en/ /apps/ /search/ IIS /tag/ /news/ /forums/ /forum/ index.php /tags/ showthread.php /page/ /category/ /articles/ Nginx
  39. 39. Comparison w/ Other Sources 39 FuzzDB (all) 850,425 +99.8% FuzzDB (web & app server) 7,234 +81.2% Dirs3arch 5,992 +77.3% Dirbuster 105,847 +98.7% Burp Suite 424,203 +99.7% 91.34% Average improvement upon existing technologies *no other approaches provide coverage guarantees
  40. 40. • Common Crawl respects (I believe) robots.txt • Certainly has a number of blind spots • Results omit highly-repetitive URL segments (integers, GUIDs) • Crawling likely misses plenty of JavaScript-based URLs • Lots of juicy files are never linked, therefore missed by Common Crawl Caveats 40
  41. 41. Resulting hit list files can be found in the following repository: https://goo.gl/lxdPDm Getchu Some Data 41
  42. 42. Big(ish) Data Sources
  43. 43. • Public archive of research data collected through active scans of the Internet • Lots of references to other projects containing data about • DNS • Port scans • Web crawls • SSL certificates Scans.io 43 https://scans.io/
  44. 44. • American Registry for Internet Numbers • WHOIS records for a significant amount of the IPv4 address space • Other regional registries have similar services • ARIN • AFRINIC • APNIC • LACNIC • RIPE NCC ARIN 44 https://www.arin.net/
  45. 45. • Awesome open source tools for performing Internet-scale data collection • Zmap – network scans • Zgrab – banner grabbing & network service interaction • ZDNS – DNS lookups Zmap 45 https://zmap.io/
  46. 46. • Use SQL syntax to search all sorts of huge datasets • One public dataset contains all public GitHub data… Google BigQuery 46 https://cloud.google.com/bigquery/
  47. 47. Google BigQuery Tastiness 47 SELECT count(*) FROM [bigquery-public-data:github_repos.files] as BQFILES WHERE BQFILES.path LIKE '%server.pem' OR BQFILES.path like "%id_rsa" OR BQFILES.path like "%id_dsa"; 13,706 SELECT count(*) FROM [bigquery-public-data:github_repos.files] as BQFILES WHERE BQFILES.path LIKE '%.aws/credentials’; 42 SELECT count(*) FROM [bigquery-public-data:github_repos.files] as BQFILES WHERE BQFILES.path LIKE '%.keystore’; 14,558 SELECT count(*) FROM [bigquery-public-data:github_repos.files] as BQFILES WHERE BQFILES.path LIKE '%robots.txt’; 197,694
  48. 48. Conclusion
  49. 49. • MapReduce • Hadoop • Amazon Elastic MapReduce • Common Crawl • Shoehorning problem sets into MapReduce • Benefits from using big data • Additional data sources Recap 49
  50. 50. • Hone content discovery based on already-found URL paths • Generate content discovery hit lists for specific user agents (mobile vs. desktop) • Hone network service scanning based on already-found service ports Future Work 50
  51. 51. • Common Crawl Hadoop Project https://github.com/lavalamp-/LavaHadoopCrawlAnalysis • Common Crawl Results Processing Project https://github.com/lavalamp-/lava-hadoop-processing • Content Discovery Hit Lists https://github.com/lavalamp-/content-discovery-hit-lists • Lavalamp’s Blog https://l.avala.mp/ References 51
  52. 52. THANK YOU! @_lavalamp chris [AT] websight [DOT] io https://github.com/lavalamp- https://l.avala.mp

×