0

Ancient “AI” in the Age of Advanced Adversaries

-

There has been a lot that’s being said about the use of AI in Cyber Security. This is for good reasons – people have said here and folks in information security (as we have called “cyber security” for decades now) have experienced first-hand. It’s only natural that already stretched InfoSec teams look at AI as the “saviour” to the skills / personnel gap to close it. Then again, there is a lot being said about companies selling products as “AI enabled” too.

But realistically speaking are there some things traditional organizations (“non-AI”) can do to actually do what many of these “AI enabled” products do? I wouldn’t have written this blog post now would I if the answer was anything but yes! 🙂

Let’s look at them:

  1. Anomaly Detection – this is age old! Almost all security tools that “alert” us on something are essentially using this. How well? That’s debatable. The kind of anomaly detection that I am talking about is simple (but different). For example, abnormal login attempts on your Internet-facing systems is anomaly detection. So is abnormality of DNS queries. Your Cloudtrail logs (in AWS) showing an inordinate spend on EC2 instances is also anomaly. A abnormally small amount of time spent between a git commit and a production deployment of that commit is also odd! Your SaaS or Okta bill being high or your APIs getting throttled (without any known changes) are all anomalies. The response time for these depends on whether or not you have been able to automate these anomalies. The day you automate these “known” anomalies you are already doing what many of these “AI enabled” products are doing today (after of course charging you an arm and leg!)
  2. UBA / User behavior analytics – a lot of products do that but the most simplistic things are reduction of logins / preventing logon from areas where you do not expect your users to originate from. This is “reduction” of attack surface. Is that foolproof? Hell no! Why? Generally, speaking adversaries do not attack systems from their home computers. Adversaries operate by using trampoline servers (sometimes layers of them) to send the attack from the “attacker controlled bots”. But it reduces your area of concentration. And then you can use UBA more effectively since you do know where your users are expected from at a macro level. To improve your “AI-ness” you can then add capabilities which are able to say not at a macro-level but on a per user level where that specific user is expected to originate from. And if it looks abnormal (or anomalous) then ask them to step up. There are numerous vendors in this space as well as products on the cheap which you could do. There are open source libraries that can also help you do that on the cheap. Again, something very expensive “AI enabled” products can do too.

I am sure there are many other things that as an organization one can start doing. Obviously, at the end of the day, every initiative takes resources and by no means are any of these simple but YMMV depending the size of your datasets, users, and organizations.

0

Machine Learning Security in the age of Supply Chain Attacks

-

As can be seen from the recent “xz attack” discovery that there nation states have realized that this is likely the “best” vector to impact large-scale systems in big organizations. With the cloud computing providers being the “source of computing” for most large corporations today, we should anticipate that a larger portion of the attacks will fall into this category. Also, just like “sleeper cells” in traditional espionage, such “sleepers” may exist in numerous OSS projects. Does that mean we should stop using open source – hell no. All that means is we just need to be careful. Can we detect these attacks? It’s tough to detect but yes we can detect them by good ol’ school, telemetry and observability.

But that’s not what this blog post is about. I think the most interesting bit from the xz attack for me was that the libraries that get harder to debug and decode are much juicier targets. How does that matter? The ML libraries that are super popular like pytorch and tensorflow and others are quite hard to compile out of bound from scratch. Such libraries can have interesting attack vectors which allow nice pickle compromises. I say “nice” because the family of insecure deserialization has existed in CWE since 2006! It’s older than many other issues and will continue to exist.

My only hope is that maintainers of core ML projects such as PyTorch, Tensorflow, keras and others start showing a slightly higher level of paranoia and build reproducibility so the supply chain attacks can be avoided on such harder to debug libraries.

0

Security Considerations in Blue-Green Deployments

-

tl,dr; Blue-Green deployments for critical uptime applications is a strong deployment strategy but if a deployment fixes critical security issues be sure that the definition of “deployment complete” is decommissioning of the “blue” environment and not just deployment of “green” successfully.

Organizations have gotten used to following Continuous Integration/Continuous Deployment (CI/CD) for software releases. The use of cloud solutions such as AWS Code* utilities, Azure DevOps or Google Cloud Source repositories enables enterprises to quickly and securely accomplish CI/CD for software release cycles. Software upgrades undergo the same CI/CD tooling and the dev teams need to choose how to do the upgrades.

There are a few different ways development teams upgrade their applications – full cut-over (same infrastructure, new codebase deployed directly and migrated in one go), rolling deployments (same infra or new infra gradually upgrading all instances), immutable (brand new infra and code for each deployment and migrated in one go), blue-green deployment (same infra, simultaneously deployed in prod, gradually phasing out old instances upon successful tests from a section of traffic). The Blue-green deployment strategy has, therefore, become quite popular for modern software deployment.

What is a Blue-Green deployment?
When you release an application via Blue-Green deployment strategy you gradually shift traffic as tests succeed and your observability (Cloudwatch alarms, etc.) does not indicate any problems. You can do that via Containers (ECS/EKS/AKS/GKE) or AWS Lambda/Azure Functions/Google Cloud Functions and traffic shifting can be done with the help of DNS solutions (Route53/Azure DNS/Google Cloud DNS), Load balancing solutions (AWS Elastic Load Balancer/Azure Load Balancer/Cloud Load Balancing). Simplistically, take your current (blue) deployment and create a full stack (green) and use either DNS or load balancers to slice out a traffic section and test the “green” stack. This is all happening in production, by the way. Once everything looks good, direct all the traffic to green and decommission the “blue”. This helps maintain operational resilience and, therefore, this is a popular deployment strategy. AWS has solid whitepaper I recommend to review to dive in from a solution architecture standpoint if you are interested.

Security Considerations

Some critical security issues (e.g., remote code execution via Log4j, remote code execution via struts, etc.) demand immediate fixes because of their severity. If your blue-green deployments are going to take days and your tests will run over a very long period (say days) then any security fixes you make also will get fixed after a successful “green” deployment and “blue decommission” only. If during that window or prior, an attacker managed to get a foothold into the impacted “blue” environment, then even decommissioning of the “blue” becomes critical to claim the issue is fully remediated. Typically, when incident responders and security operations professionals breathe a sigh of relief is when the fix is deployed. Typically, the software engineering teams consider fix as deployed is when the “green” is fully handling all traffic (and its unrelated to the decommissioning of the “blue” environment). In this case, the incident responders need to remember its not the deployment time when the risk is truly mitigated, its mitigated after completion of cut-over to green and decommissioning of the blue. There is a subtle, yet important, difference here – and it really comes down to the use of shared vocabulary. As long as security operations and software development teams both have this shared definition of what deployment means, there are no misunderstandings.

0

Security Considerations in use of AI/ML

-

The world of Artificial Intelligence (AI) seems to be exploding with the release of ChatGPT. But as soon as the the chat bot came into the hands of public people started finding self-sabotaging queries at worst (exploitable issues) and some weird interactions whereby people could write malware that could stay undetected by Endpoint Detection and Response (EDR) bypasses.

What is AI?

Very simply, Artificial Intelligence (per Wikipedia) is intelligence demonstrated by machines. But technically, it is a set of algorithms that can make do things that a human does by making an inference, similar to humans, on the basis of data that was historically provided as “reference” to make the decisions. This reference data is called as training data. And the data which is used to test the effectiveness of the algorithm to arrive at a decision on the basis of that reference, is called as test data. Any good machine learning course teaches how do you design data and how much data to use for training and how much to use for testing and metrics of performance but that is not relevant to our discussion here – however, what’s important is that it is the data that you provide that controls the decision-making in an artificially intelligent algorithm. This is a key difference between typical algorithms (where the code is more or less static and makes decisions on certain states in the program) whereas in an artificially intelligent system you can have the program arrive at different decisions depending on how one decides to “train” the algorithms.

What is ML?

Machine Learning (ML) is a subset of Artificial Intelligence (AI) where the artificial intelligent algorithms evolve their decision-making on the basis of data that has been processed and tagged as training data. ML systems have been used in classifying spam or anomaly detection in computer security. These systems tend to use statistical inference to establish a baseline and highlight situations where the input data does not fall within the norm. When operational data is being used to train ML-based system one has to be careful that we are not incrementally altering the baselines of what’s normal and what’s not. Such “tilting” may happen over time and its important to protect against drift of such systems. Some “drift” is ok but “bad drift” is not – which is hard to predict. E.g., let’s say you classify some data inaccurately and accidentally/maliciously end up using it for training your ML-models but if it inherently alters the behavior of the ML model, then the model becomes unreliable.

What is Adversarial AI?

Adversarial Artificial Intelligence (AI) based threats are ones where malicious actors design the inputs to make models predict erroneously. There are a couple of different types of attack here – poisoning attack (where you train models with bad data controlled by adversaries) or an evasion attack (where you make the artificial intelligence system make a bad inference with a security implication). The way to understand these attacks is that the poisoning attack is basically “Garbage-in-garbage-out” but its this really “special” garbage. This is garbage that changes the behavior of the algorithm in a way that the algorithms returns an incorrect result when it has to make a decision. The inferential attacks are different in that the decision made is wrong because the input is such that it appears differently to the ML algorithm than it does to humans. E.g., Gaussian noise being classified as a human or a fingerprint being matched incorrectly.

Can we attack these systems in other ways?

In a paper presented by Google researchers created a tool (TensorFuzz) that they were able to demonstrate finding a few varieties of bugs in Deep Neural Networks (DNNs). So typical software attack techniques do work against the deep neural networks too. Fuzzing has been used for decades and has caused faults in code forever. At its core, fuzzing is simple, send garbage input that causes a failure in the program. It’s just that the failures in DNN are different and you want to ensure the software relying on the DNN to make a decision handles such failures appropriately and do not cause a security failure with secure defaults.

Protection mechanisms

There are a few simple ways to look at ML systems and security thereof. Microsoft released an excellent howto on how to threat model ML systems. Additionally, using adversarial training data is imperative to ensure that artificially intelligent algorithms performs as you expect them to in the presence of adversarial data. When you rely on ML-based systems, its all the more important that you test it appropriately and continue to do so against baselines. Unfortunately, for Deep Neural Networks transparency of decision making continues to be an issue and needs the AI/ML researchers to establish appropriate transparency measures.

0

What to do when things go wrong?

-

I blogged earlier about blameless post-mortems and how one gets to a point that they are able to do blameless post-mortems – by having an operational rigor and observability. This is more of a lessons learnt post about what do you do and what you don’t when things go wrong?

Focusing on the Who?

A lot of times focusing on the “who reported the issue?” can be focusing on a wrong thing. If a report comes from a penetration test or a bug bounty researcher or an internal security engineering resource you need to make sure that the impact and likelihood is clearly understood. There are sometimes where customers (who pay or intend to pay for your service) report problems – these are obviously more important.

Focusing on the How?

How a security issue gets reported is important. As examples where you learn about a security issue via a bug report(1), or where you learn about it via your own telemetry(2) or you learn about it on Twitter! There is a potential for legal ramifications in each of these cases and the risks might be different. When things become public without your knowledge where you were not notified and the information is now public you do have a role to instill confidence in your current customers. The best approach here tends to be of sticking to facts without any speculations. If you are working on incident say so. Don’t say we are most secure when you are the subject of a breach discussion especially because you already have the data that you are not as secure. Identification and Containment of the security issue are top priorities – do not take resources that are doing these actions away to ensure Public Relations are good – doing that will eventually make public relations bad! Involve lawyers in your communications process and mark communications with right legal tags (“attorney client privileged material”) so that if a litigation happens you can clearly demarcate evidence that can be or cannot be part of a discovery.

Focusing on the What?

“What” needs to be done has to be clear with the help of an incident manager. The incident manager is the person who is most well read, subject matter expert, and leads the response process. Having this single-threaded ownership of leading the incident is incredibly important. The role of the incident manager is to ensure they have all the information that they need to make decisions. This also streamlines the process of public relations, legal needs, incident cleanup (eradication and recovery), and helps with swift and focused decision-making. This can sometimes be crisis management depending on impact and otherwise it can be just another day in the Security operations office. The key trait here is focus and goal-based decision making. Adrenaline can run high, tempers can flare – that typically happens when you are unprepared to handle security incidents. The tempers and nervousness can be avoided by being proactive in doing tabletop exercises, incident dry-runs and having good runbooks. But all practice games do is prepare you for the real thing – the real thing is how you handle a true incident. Use the help of key stakeholders to derive best decisions – there often tend to be situations where no answer looks good – and therein comes the customer focus – if you focus on the well being of customers you will rarely go wrong.

Focusing on the Why?

Capture incident response logs in tickets and communications so all the timeline and actions get captured properly with documentation. After the recovery is completed, do a blameless post-mortem of how you got there. Ensure you put a timeline of taking on agreed-upon corrective actions on a timeline that is agreed and don’t waiver – this is a part of operational rigor one needs to follow to really avoid incidents from happening in future. Typically, the reason why issues happen is because something was not prioritized as it should have been. Reprioritize to make sure you can reassess. Sometimes the size of the incident makes it your reprioritization almost coerced – it’s ok to be coerced in that direction. You will find that coercion is simply an acceleration of the actions that you should have taken up earlier. No one is perfect – just come out of it better!

Focusing on the Where?

Where you discuss the issue is important. When sizable incidents happen discuss is openly with the business leaders so that full awareness and feedback is provided in “powerful forums”. This obviously does not mean that you break your attorney client privilege – it just means discuss with the highest leaders in a manner where action items, impact and post-mortem results are provided. This enables business to become resilient over time and develop confidence in the security teams. If you need to do public releases then ensure that lawyers read it and security SMEs read it as well as business leaders read it – only then do such releases. Don’t let the “left hand meet right” situation ever occur. This instills customer confidence in your process.

Conclusion

This was just an attempt for me to pen-down my thoughts as they appeared in my brain. I am sure I forgot a lot such as stress of handling, avoiding knee-jerk reactions, etc. but these are top most important things that I felt were necessary to share. Remember, incident handling gets better with practice – you want the practice be done in practice games not in the olympics! 🙂