Thursday, August 9, 2018

BH18: AI & ML in Cyber Security - Why Algorithms are Dangerous

Raffael Marty, VP Corporate Strategy ForcePoint

We don't truly have AI, yet. ALgorithms are getting smarter, but experts are more important. Understand your data and algorithms before you do anything with them. It's important to invest in experts that know security.

Raffael has been doing this (working in security) for a very long time, and then moved into big data. At Forcepoint, he's focusing on studying user behavior so that they can recognize when something bad is happening. ("The Human Point System")

Machine learning is an algorithmic way to describe data. In supervised case, we are giving the system a lot of training data. Unsuperfised, we give the system an optimization for it to solve.  For "Deep Learning" - it is a newer machine learning algorithm. It elminates the feature engineering step.  Data mining is a set of methods to explore data automatically.  And AI - "A program that doesn't simply classify or compute model parameters, but comes up with novel knowledge that a security analyst finds insightful" (not there, yet).

Computers are now better than people at playing chess and Go, they are even getting better at designing effective drugs and for making things like Siri smarter.

Machine learning is used in security, for things like detecting malware, spam detection, and finding pockets of bad IP addresses on the Internet in supervised cases, and more in unsupervised..

THere are several examples of AI failures in the field, like the Pentagon training AI to learn tanks (they used sunny pictures for "no tank" and cloudy with tanks, so the AI system assumed no tanks were in sunny weather... ooops!)

Algorithms make assumptions about the data, they assume the data is clean (often is not), make assumptions about distribution of data and don't deal with outliers.  The algorithms are too easy to use today - the proces is more important than the algorithm.  Algorithms do not take domain knowledge into account.  Defining meaningful and representative distance functions, for example.  Ports look like intergers and algorithms make bad assumptions here about "distance"

There is bias in the algorithms we are not aware of (example of translating "he is a nurse. she is a doctor" from English to Hungarian and back again... suddenly the genders are swapped! Now she is a nurse....)

Too often assumptions are made based on a single customer's data, or learning from an infected data set, or simply missing data.  Another example is an IDS that got confused by IKE traffic and classified it as a "UDP Bomb".

There are dangers with deep learning use. Do not use if there is not enough or no quality labelled data, look out for things like time zones along with timezones. You need to have well trained domain experts and data scientists to oversee the implementation, and understand what was actually learned.
Note - there are not a lot of individuals that understand security and data science, so make sure you build then a good, strong and cohesive team.

You need to look out for adversarial input - you can add a small amount of noise to an image, for example, that a human cannot see, but can trick a computer into thinking a picture of a panda is really a gibbon.

Deep learning - is it the solution to everything? Most security problems cannot be solved with deep learning (or supervised methods in general). We looked at a network graph - we might have lots of data, but not enough information or context nor labels - the dataset is actually no good.

Can unsupervised data save us?  Can we exploit the inherent structure within the adta to find anomalies and attacks?  First we have to clean the data, engineer distance functions, analyze the data, etc...

In one graphic, a destination port was misclassified as a source port (80!), and one bit of data had port 70000!  While it's obvious to those of us with network knowledge that the data is messed up, it's not to the data scientists that looked at the data. (with this network data, the data scientists found "attacks" at port 0).

Data science might classify port 443 as an "outlier" because it's "far" from port 80 - but to those of us who know, they are not "far" from each other technically.

Different algorithms struggle with clustered data, the shape of the data.  Even if you choose the "right" algorithm, you must understand the parameters

If you get all of those things right, then you still need to interpret the data. Are the clusters good or bad? What is anomalous?

There is another approach - probabilistic inference. Look at a Beysian Belief Networks. The first step is to build the graph, thinking about the objective and the observable behaviors. If the data is too complicated, may need to introduce "grouping nodes" and introduce the dependencies between the groups. After all the right steps, you still need to get expert opinions.

Need to make sure you start with defining your use-cases, bot by coosing an algorithm. ML is barely ever the solution to your problem. Use ensembles of algorithms and teach the algos to ask for input!  You want it to have expert input and not make assumptions!

Remember - History is not a predictor, but knowledge is"