The modern web-scale network is a pretty complicated place. Modern techniques in Systems Management have made it trivial to create, destroy and repurpose any number of instance types. These instances can span the range from bare metal machines sitting in a datacenter, to 3rd party virtual machines on demand, and now these new containers and microservices seem to be all the rage. Instances are cattle, they are no longer pets. All of this perpetual churn and flexibility is exactly what you want in a constantly changing, highly available, and efficient infrastructure. The ability to create or destroy nodes on demand, or continuously and automatically scale up, down, and re-deploy applications as part of a continuous integration pipeline, have become necessary and an integral part of daily operations. However these systems can generate terabytes of network logs a day. And if your job is detecting, correlating, and alerting on the correct anomaly in all that data, the analogy of the needle in the haystack really doesn’t do it justice, something closer would be akin to finding a needle in the windstorm. How do you begin to collect, store, analyze, and alert on this much data without costing the company a small fortune? What are some practical steps you can take to reduce your overall risk and begin to gain more insight, visibility, and confidence into what is actually taking place on your network? This talk aims to give the attendee a solid understanding of the problem space, as well as recommendations and practical advice from someone who built their own ‘big data’ network and security monitor. It really is easier than it sounds.
65. Questions
What goals am I trying to accomplish?
What are the sources of truth?
What tools would work best?
What is an anomaly?
Am I correlating the alerts?
What about user experience?
Is the system robust and secure?
What else can I do with all the data?
68. name: travis carelock
twitter: @l3d
email: travis@soundcloud.com
pgp: 463E B548 F3B1 F879 4589 6505 E417 7480 D1A4 A990
private: travis@carelock.net
pgp: 4CFC 8E69 4A07 59F2 4508 8A39 0AFA 9CC3 2D65 031E
otr: l3d@dukgo.com
fingerprint: 40FCAFD7 FAA097B6 29BE95CE 6740E37E 0790E295
is hiring!
Web: http://soundcloud.com/jobs
Email: jobs@soundcloud.com
Thank You!
Special Thank You to Code Blue and the Organisers!
Notas do Editor
Let’s Start. Hello everyone and thank you for coming to my Presentation. My name is Travis Carelock and this is Practical Network Defence at Scale or Protecting the Eierlegende Wollmichsau. (….) First of all I would like to thank Code Blue for giving me the opportunity to speak it you. I am very honoured and hope that everyone here finds this presentation useful.(…) There is a lot of material to cover, so this presentation may move quickly.(…)
One more thing before I start. I will be focusing on network defence, but all your logs have a wealth of information. Please do not get too focused on the implementation details, instead try to find inspiration, and think about how you can apply some of these techniques to your own organizations. (…)
Who am I? As I said, my name is Travis Carelock, and I currently work as an Engineer on the Security Team at SoundCloud in Berlin. (…) In the past I have worked for Black Hat and the Louisiana Department of Justice.
Why did I want to talk on this subject? Well first, I love defence. (…) I always have. Some people love to attack, to figure out a weakness, exploit it and move on to the next target. (…) I enjoy understanding my environment, both the strengths and the weaknesses. I enjoyed raising the walls, laying traps, and keeping watch from the tower. (…) I also enjoy the adversarial nature of it. There are real humans out there attacking. (…) So you have to adjust and stay on your toes.
And. I love avoiding this. We also get paid to avoid this.
Focus on network defense. A custom scalable SEIM. LOTS of logs for your own use.
So I hope to show with this talk some real concrete steps you can take to start building up your defences give yourself some piece of mind. (…) I will be focusing on network defence, but all your logs have a wealth of information. Please do not get too focused on the implementation details, instead try to find inspiration, and think about how you can apply some of these techniques to your own organizations. (…)
I thought about the best way to present this. (…) Due to the nature of defense, log analysis, and anomaly detection, there is really no one simple answer. No one simple button to press. Each infrastructure must be analysed.(…) An effective security monitor can only be created and tuned by understanding the environment in which it operates.(…)
Therefore, I thought it was sensible to present a series of questions that one should answer before begin building a network monitoring solution.
The first question. What am I trying to accomplish? (…) This question seems basic but the answer shapes everything to come after it. So it is super important to get right.(….) From a security point of view, the question is also deceptive.
People outside the organization might say something like: “What are we trying to accomplish?! We are trying to be secure and not get hacked! (…) You’ve seen all the hackers out there, right!? STOP THEM!”
We know that is not realistic. Sure there might be people trying to expoit the infrastructure in one manner or the other. (…) But fearing some amorphous, evil force who’s “smarter” than you, better equipped, and has WAY more coffee and time than you do. Is both not helpful and won’t accomplish anything.
Good security can only come from a realistic assessment of the environment.
Then use a systematic and sustained effort to mitigate the highest risks as best as possible. Periodically repeat the risk assessment to ensure the organization is still spending resources in the necessary places. It can be a slow process, but it is the only way to make real defensive progress. (…)
So it is required to have a detailed view of the environment. This must come before we can answer the question “What are we trying to accomplish?” (…) Funny enough, we will use the example of a “fast-moving” start-up.
A bit of a disclaimer here. I will try to make this example a generic as possible. After all I still want to keep some of secrets. (…)
The example environment consists of large, flat networks with thousands nodes. When we say it is flat that means that nodes can “see” each other on the network without segmentation. It is a giant beehive of activity. These nodes serve billions of daily user actions. (…) That means billions of times a day users are logging in, playing songs, uploading tracks, messaging and commenting with their friends or in someway interacting with this network. (…) Most of the events trigger fan-out type connections to multiple services as each user’s request is fulfilled. Now lets focus in on what any one of those nodes could be.
When we look at a node. In this environment, it is an incredibly dynamic At any given time it can take on any number of tasks and configurations.
For example. A node could be deployed on:(…) A physical machine or machines in a datacentre.(…)
Virtual Machines or a container (…) Infrastructure Equipment such as routers and switches (…)
Cloud Provided Assets.(…) Or even temporary nodes that represent VPN users logging in(…)
And these nodes, they can be serving a variety or services or roles.
Some could be running a very important and complex application or microservice.(…) Some could be a datasource. (..)
It might be serving as part of a cluster, even a data source cluster (…)
It could be providing internal services such as DNS, DHCP, or running tests(…)
Many people have load-balancers helping to spread out traffic(…)
The node might be a corporate user querying for business intelligence data(…)
It could be the Engineers scaling nodes and redeploying services. Or just poking around(…)
And some might even be part of a security system.(…)
You also have things to look out for… namely zombie machines or malicious entities. (…)
I would l like to speak a moment about way the modern infrastructure has changed the way administrators and engineers view and manages these nodes.
Some of this has been said in a few talks before, but I would like to expand on the concept a bit.
Basically, in the past because nodes were generally so hard to rebuild we treated them as pets. We took care of them. And prided ourselves on the uptime counter. If our servers needed ANYTHING, ANYTHING at all… we did it. Does the 200 Kilo server need to be moved up 3 flights of stairs to a brand new air conditioned closet? DON’T TURN IT OFF!!! Ask 3 interns to help carefully carry the 40 Kilo backup battery alongside. It was ridiculous. As scale environments started to become more normal a better way server management was needed…. DevOps is the broad name for this. And it really concerns itself with the management of systems with repeatable code. Continual Integration is closely tied as it works to automate the deploy-to-production pipeline. As a result of all this automation…..
We now treat nodes as cattle. Did something get corrupted? Some physical failure? Did we rm the wrong thing? No problem, make sure the load balancer handles the load, rack a replacement, and turn it on. But, this analogy is not quite correct. Because cattle can only do certain things. Give milk, and meat. Maybe pull your wagon. (…) But the modern system engineers dream to view every node as a Eierlegende Wollmichau.
A Eierlegende Wollmichau is a german term. Literally it is an “egg-laying wool, milk pig” It is a pretend farm beast that provides everything one might need. Engineers want a node like this. Nodes that can do anything. One moment a node might be part of a database cluster. (…) Then an hour later, the physical machine that the node was on is wiped, the node is destroyed and instead two different application nodes take its place on the hardware.(…) The original database node is then deployed to an entirely new IP, behind a load-balancer and hosted in a Virtual Private Cloud.(…)
Or possibly during peak times 50 application nodes are automatically scaled to 150. And then after peak, the excess nodes are all destroyed.(…) These are very typical daily activites. SEIMs have not been able to keep up with this. Many were built for the typical environment of three to five years ago…..
So…. Any node can be anything at anytime.
This looks a bit scarier now….
But don’t worry! One step at a time. Just start with a simple goal and expand from there.
Let’s get back to what our goals are. Now that we have an example environment, some definite network security goals really begin to surface. (…) You can really go crazy here, but remember small bites, and attainable results will both help your security and your sanity. (…)Here are three very reasonable goals that a system like the one we are designing should accomplish. I will go through each one for clarity.
The first goal would be to simply investigate network traffic between nodes, or between logical collections of nodes. (…) A collection of nodes could be something like a database cluster, container group, a scaled application, a micro-service, etc. (….)
For any given logical grouping of nodes used by the Systems Engineers I would like to be able to investigate connections to any other logical grouping of nodes.(…) From a layer 3 point of view, I would like to know IPs, Ports, and data transfer.
Next, I would like to write rules around this traffic. And then be alerted when these rules are violated. (…) For example, I would want to allow a node or collection nodes connect to a database cluster, but I would like to be alerted if any other nodes in the network attempt to connect via 3306. (…) Or, maybe I would like to be alerted when any database node made a connection to any IP on the internet. Or any IP outside of it’s allowed range. The rule possibilities are literally endless with the right query system.
Finally we would like to be able to store this data for a determined amount of time, and perform various forms of analysis. And if necessary, provide forensic evidence after an alert has been triggered in order to assess the extent of damage or further compromise. (…) This could be very important.
Now that we have an idea of what we want to accomplish. We can move forward to the next question. What are the Sources of Truth?
What contains the information I need to answer the questions and accomplish the goals?(…) What data do I collect and analyse? If the current data doesn’t exist, can I build something that will produce it? (…) Looking back at our example. Can we find sources of truth emitted by systems in that infrastructure that will help to create a network monitor? YES!
These are just some of the examples that could exist in your network. I will tell you a secret, to monitor network traffic we are going to highly rely on generating traffic flow logs generated by the switching and routing infrastructure. (…) These will give a very useful and independent view of the network from a layer 3 perspective. (…) Obviously firewall and Intrusion detection systems logs would be important, but I encourage you to expand your log collection as far as possible.(…) Think about logs from infrastructure services such as DNS/DHCP, host based logs, Application logs, database logs, Amazon Cloudtrail, S3 logs, And sometimes we event want to use code to create small services that emit speciality logs.
Also you will want to understand the nature of the data you are collecting.
This is a critical step.(…) Ask questions like, how consistent is this data? Does it come in erratic? Does it measure what it says it measures? Is it independent? Could it have been corrupted in some way? (…) How reliable is it? This will allow you to give it a confidence score as it relates to any given investigation or query. For example, if a host node is suspected of being compromised. (…) The auth.log says no one has logged in however the sflow logs from the connected switch clearly shows lots of ssh traffic. One of those logs would have a much lower confidence score than the other. (…) Finally keep in mind the retention policy of this data. How long do they or should they stick around. That all depends on the risk profile associated with that data. Some things you just want to get rid of.
Eventually you will want to search for similar items across all logs. If you just blindly dump all the logs into a giant vat, they will end up only being useful in reference to other logs of the same type.(…)
To get the most out of the logs, the first step is to normalize. (…)
You will notice here that all three of these logs do display a timestamp… And that is necessary, but notice that they are each in slightly different formats. (…) You will want to normalize all these into a single timestamp standard format during pre-processing. Luckly, Logstash uses Jruby to modify log-lines on the fly.
.
Here we see data transfer. Is it in bytes, bits, mega-bytes? Again normalize to a standard.
If you tag the logs during pre-processing then, searching for similar fields across all the logs is possible. (..) Take the time to chop up and GROK your log files, it pays off in the end. (…) Here we are tagging this a src_ip, not matter what the log file calls it.
Here we would apply a “dst_ip” tag.
And finally TYPE the individual fields as they are tagged. Not everything is a string.(…) For example anything that is tagged as a src_port is also an Integer. (…) This will allow you to preform calculations based on the variable’s type. So that means counts, addition, ranges.
So now you can answer questions like: for a given src_ip, what did it connect to, and how many bytes where transferred? How much of a change relative to yesterday at the same time period?
For ‘’IP’’ typed fields these calculations include IP ranges. Which makes full IP subnet range a valid query. This would not be possible if everything was just left to the default “string” type.(…)
Again, all this can be done within logstash. But there are any number of different opensource libraries you could use to interact with elasticsearch and basically ship JSON.
Obviously I’ve been speaking about chopping up logs, and that leads us to our next question. What tools are we going to use?
As you can already see collecting, indexing and asking questions of logs will be our primary way of accomplishing our goal. So, what are you going to use to accomplish the tasks? (…) Due to that fact we are working with logs, the primary tool I would like to use is Elasticsearch. It is a data store and the engine that drives ELK stack.(…) The ELK stack is a modular set of tools that have some very complementary features. They are Elasticsearch, Logstash, and Kibana. (…) Logstash helps to ingest, modify and tag logs before shipping to the ElasticSearch Datastore. And Kibana is a great web visualization tool.
As you can see, Kibana does make for pretty dashboards and include interesting features like maps.
And I do need to stress here. ElasticSearch is great… Just to give you an example. I am continually ingesting and indexing 35K log lines per second and generates about 1.5 TB daily. (…)
For this system, this is ES. This is me
As you bring data back to logstash from these various sources, some of the underlying tools may need tweaking and non-standard conigs in order to keep up with the scale. (..) And at some point you will want to write some your own code to make your left easier, and sometimes just to get the job done. (…)My advice, no matter what the language, finding well supported libraries are key. Many people face similar issues and there is no need to reinvent the wheel.(…)
And I can not stress this enough, when building these high scale systems:
Now you have the goal. Investigate network traffic and setup rules, and alerts. You have the sources of truth, and the tools with which to analyse them. Now you have to ask What is the Target?(…) You can’t just say, “Ok computer, Show me everything Bad.” Security is accomplished by systematically reducing the risk and increasing visibility.(…) Start with a narrow scope and work out from there. You can find the best narrow targets by performing a general risk analysis. Find out what is most important to the business. A particular database cluster or all the database clusters might be an example.(…) But if we were not focused on network security with this example, the focus could just as easily target something like AWS Console, API activity, and S3 bucket access, or user access in the production environment. This all depends on the organization.
First in order to separate malicious traffic from normal traffic, we need to know what normal traffic is. (…) Who should be connecting to the database? We need to understand the logical side of the network.
In the modern scaled network, system engineers don’t create every machine or container by hand. So there must be some set of systems that have a high-level understanding of nodes, and their deployment configurations.(…) In addition, most services or micro-services deployed have their own set of dependency services that must be known about and discoverable. (…) Find these systems and interface with them. They understand the world.
Some of the more popular system management systems out are Chef, Puppet, Ansible, and CFEngine. These are the systems are the backbone of most DevOps infrastructures. Query their data. (…)
In addition most of these enviornments also have some method for applications to automatically discover their service dependencies. This could be something like DNS and other service discovery tools.(…)
The cloud services also have their own APIs that you can query to get a varity of information about the instance and its tags.(…)
Source Code is a excellent place to look to understand how services connect to one another. If there are a huge variety of code types and config files, a standardized, machine-readable info file could be add to the root directory of the project.(…) And there is no getting around it. For somethings a small amount of code will surface the information you need.
The main point is to understand what powers the infrastructure, determines a node’s configurations, as well as an application’s dependencies. (…) It is important to develop a repeatable method for querying this information.
Because, ideally once that we understand how to query reliably, the next step is to automate. (…) Automation is always helped by consistency. Try to standardize procedures where possible and there is consensus. (…) A side note here, be sure to work with the engineering team on this one. Some of these queries can be very costly to a high scale environment. (…) Think about adding a cache layer to the system, this could take pressure off of the other infrastructure. It can also make the overall system more robust if some external resources becomes unavailable.
By correlating the data from these various “management systems” it becomes possible to create a current “view of the world”. (…)
It becomes possible to answer questions such as what are the IPs and hostnames for a given cluster of database nodes. Which containers are serving microservce_A? (…) What are the virtual IPs for a particular set of loadbalancers? Which datasources and dependencies are Application_B expected to require? (…)
With these answers it is also possible to build baseline, normal behaviour patterns.
This works pretty good. A custom query to the elastic search data store based on the expected view of the world.
But this world doesn’t stay still. (…)
Due to the constant churn and redeloy of nodes and applications, these “views-of-the-world” need to be rebuilt constantly. (…) Node information becomes stale very quickly. Mappings and views that once were associated with a certain type of traffic will change over time.(…) Most SEIMs can not deal with this.
From experience, as these system begin to be brought online, be prepared to deal with many false positives. How much depends on the amount of chaos currently in the infrastructure.(..) But with careful consideration these can be delt with.
One way is better design of the data queries.( ..) For example there may be the need to incorporate whites lists when edge cases arise.(…)
Additional services could be built to further enrich the data set. It might be possible to verify a connection was created by a certain user or user group and therefore okay, even if it is outside the expected flow. (…)
The establishment of consistant policies and guidelines will help developers and operators configure their systems in a similar manner, and create a predictable and knowable pattern.(…)
And don’t forget blocking. Many times it is better to stop access altogether with technology such as firewalls or layer2 segmentation.
So now we have this great way for anomaly detection on the network, what are we going to do when the anomaly is detected?
Not all anomalies are equal, so neither should the alerts actions that they produce be.
.
Not all anomalies are equal, so neither should the alerts actions that they produce be.
Slide – 32 Alert actions
Some of the alerts should just result in the production of another log line.
Some might require an email.
Others might be a bit more important and sent via IM or irc.
As the severity increases it might be required SMS or pager services.
And some require the message to put on pants,(…)
Buy a bus pass, (…)
Ride to your house. (….)
And wake you up. (…)
In this system the alert results should be considered data as well. If a tested was passed, is it because of a threshold count or a whitelist? If a test is passed because the count is below a certain threshold, how far below? (..) Enrich you alert logs as much as possible and feed that back into the system pipeline. It will now be possible to create escalation chains of alerts. You could also implement kill-chains. (..) Maybe you don’t want to get paged if there was an nmap scan detected, or if someone logs into the VPN after midnight, or that a VPN login was from outside countries with offices. But if you see all three things within 10min then you may want a page.(…) Or something as simple as, every 4 email alerts generated in 10min sends a page.
An alert could trigger queries to external services. Depending on the answer received, the alert action could be changed.(…) For example, an ssh login from an unexpected geo location, could trigger a certain level of alert, but if a query to an external service could verify that the user was expected to be in that area, then a lesser action could be taken.
All these alerts and actions sound great but…….
I think I should stop and say something about Alerter Fatigue. It is a very real problem. If a system emits too many alerts, operators can become swamped and overrun.(…) Or if the system incorrectly classifies the severity of the anomaly and operatoers get paged for trival matters, then fatigue will set in. And it becomes all too easy to become complacent to the noise. (…) Invest time and resources in getting this part right. Save engineering time and stress, saves the organization money in the long run.
By now you can tell that this system is getting some what complex. As rules, alerts, actions, and external integrations are added this complexity will only increase. It is important to think about human to system interaction. After all, one of the keys of success is getting as many people in as possible using the system. They will only do this if they can reliably interact with the system
I have already spoken about how great Kibana is for an investigation tool. It also has the ability to load external json configuration files, this makes automatic scripting of dashboards possible. As you can see here.(…) I would hate to have to manually type all these IPs into a web interface.
But it would be wise to consider other User Interfaces in which to manage this system. It is true that as security people web design and user interfaces are not our first priority, however it is surprising easy it can be with some of the modern frameworks. From personal experience I was able to mock-up something very reasonable, in a short amount of time using the Sintra framework.
Some of the capability you might want to include in a UI is:
1: Kibana dashboard generation
2: Alert creation, edit and deletion, snooze, and whitelist capabilities, grouping, sorting, searching, display and export
3: History of world views.
4 : General infrastructure query tools.
5: Useability helpers, docs and links
So you have built this pretty intricate and complex interconnected group of systems. And you should be proud. But is it working as intended? (…) Is it robust and secure? (…) Arguable this one of the most important question you could ask. And many times it is overlooked.
Consider creating test that go beyond the build pipeline. (…) Create a set of small. SAFE, and I can’t stress that word enough, SAFE. Red-teaming apps, bots, and scenarios to test the alert sets, and time to discovery. (…)Create helper scripts that prune the alert sets themselves and look for unnecessary allow statements. This can help prevent stale alerts and privilege creep.(…)
In thinking about a robust system, consider how the system design would react to the loss of a data node? Or Loss of connections to an external service or data store? (…) What if the node cannot connect to the internet? Can it rebuild primary systems without it?
So now there is this system up and running. Hopefully delivering meaingfull alerts with few false positives. Slowly, but surely you will start to expand out your range of visibility and coverage within the infrastructure and even user machines. (…) There are really endless possibilities in the modern network. But there is also this wonderful treasure trove of data at your fingertips. And there really is so much that you can do with it.
If you are lucky like me, maybe you have some people on your team that understand all this math, statistical analysis, and machine learning stuff. (..) You can then begin to understand the data in new ways and detect anomalies you never could before. Everything we have been talking about today is very targeted at protecting a known item, with known rules.(…) Machine Learning can help you look for anomalous activity over the entire network. (…) Literally find thing you were not looking for.
Even though the primary function of this network monitoring system is security. It may be able to help other teams in your organization. It can help track down deployment problems, configuration errors, usage, or any number of other issues that show in the network traffic.
This system can also be crucial when it comes to an external audit or inquiry.(…) Because you have the network traffic, dns, and other service logs you can prove or dis-prove assumptions. (..) In any digital law investigation I’ve ever been part of, the first question is “Where are the logs?”. That’s what auditors want to know as well.
Revealed by a detailed understanding of the organization
Find or create the relevant data
Tools…for indexing logs Elasticsearch
What is an anaomaly
Build a world view, correlate logs, and test it.
Am I correlating my alerts
Killchains? Escalations Chains
What about User Experience
Easier for people the better
Is it robust and secure?
Yes?
What else can I do with this data?
So, I’m coming to the conclusion of this talk. I hope that this talk has been useful to you. (…)
And I wanted to leave with a few word. I know defense can be lonely. It doesn’t get all t
he glamour of offence. But remember at the end of the day, you are the one everyone depends on.
This is you as long as you take up the challenge. You are the game master because you wrote the rules. You understand the technology. It is your playground. (…)
The only thing that can fail you is the hardware. And we can all ways get more hardware.