Security is at the core of every bank activity. ING set an ambitious goal to have an insight into the overall network data activity. The purpose is to quickly recognize and neutralize unwelcomed guests such as malware, viruses and to prevent data leakage or track down misconfigured software components.
Since the inception of the CoreIntel project we knew we were going to face the challenges of capturing, storing and processing vast amount of data of a various type from all over the world. In our session we would like to share our experience in building scalable, distributed system architecture based on Kafka, Spark Streaming, Hadoop and Elasticsearch to help us achieving these goals.
Why choosing good data format matters? How to manage kafka offsets? Why dealing with Elasticsearch is a love-hate relationship for us or how we just managed to put it all these pieces together.
4. Core Intel is a part of ING Cyber Crime Resilience Programme
to structurally improve the capabilities for the cybercrime
• prevention
• detection and the
• response
CoreIntel
4
5. • Measures against e-banking fraud, DDoS and Advanced Persistent Threats (APTs).
• Threat intelligence allow to respond to, or even prevent, a cybercrime attack
• (This kind of intelligence is available via internal and external parties and includes both
open and closed communities)
• Monitoring, detection and response to “spear phishing”
• Detection/mitigation of infected ING systems’
• Baselining network traffic/anomaly detection
• Response to incidents (knowledge, tools, IT environment)
• Automated feeds, automated analysis and historical data analysis
The reasoning
5
11. • What kind of data do we need?
• Where is our data located?
• How we can potentially capture it?
• What are the legal implications?
So there is a challenge to capture „all” the data
11
26. In memory data grid
26
val rddFromMap = sc.fromHazelcastMap("map-name-to-be-loaded")
27. Let’s find something in these logs
27
Photo credit: https://www.flickr.com/photos/65363769@N08/12726065645/in/pool-555784@N20/
28. Matching
28
Tornado - a Python web framework and asynchronous
networking library - http://www.tornadoweb.org/
MessagePack – binary transport format
http://msgpack.org/
29. • Automatically & continually match network logs <->threat intel
• When new threat intel arrives, against full history network logs
• When new network logs arrive, against full history threat intel
• Alerts are shown in a hit dashboard
• Dashboard is a web-based interfaces that provide flexible charts, querying, aggregation
and browsing
• Quality/relevance of an alert is subject to the quality of IoC feeds and completeness of
internal log data.
Hit, alerts and dashboards
29
30. Be smart with your tooling
30
Photo credit https://www.flickr.com/photos/12749546@N07/
36. Core Intel allows users to perform advanced analytics on network logs using a set of
powerful tools
• Spark API to write code to process large data sets on a cluster
• perform complex aggregations to collect interesting statistics
• run large scale clustering algorithms with Spark’s MLLib
• run graph analyses on network logs using Spark’s GraphX
• transform and extract data for use in another system (which are better for specific analytics or
visualization purposes)
• Kafka, co you can write own Consumers and Producers to work with your data
• to perform streaming analysis on your data
• to implement your own alerting logic
• Toolset
• Programming languages: Scala, Java, Python
• IDE’s: Eclipse / Scala IDE, IPython Notebook and R Studio
Advanced analytics
36