Artificial intelligence has given operational security a new ally. Véronique Legrand, a researcher in the field, explains how machine learning can help protect information systems against increasingly targeted and ingenious attacks.
"The current challenges in operational security are simple: we are faced with a deluge of data and a huge variety of data to process. To use an analogy, on just one of the information systems supervised by Intrinsec, it's as if security analysts had to process more than 11,000 books written in more than 10 different languages every day", explains Véronique Legrand, a researcher in Security and Artificial Intelligence at the Insa in Lyon and head of Innovation and Research for Intrinsec.
This inflation of data, the famous big data, goes hand in hand with new channels, social networks, websites and connected objects, all of which are likely to lead to an attack on the company.
These attacks are becoming increasingly dangerous, carried out on a massive scale by malicious and particularly inventive individuals. Security analysts no longer have the time to acquire the knowledge they need to respond effectively. Particularly as this knowledge covers both the technologies under surveillance (cloud, connected objects) and their vulnerabilities, not forgetting the analysis of their adversaries' modus operandi, which increasingly involves the psychological dimension (phishing, etc.).
To counter these intrusions, security analysts are constantly applying complex and risky rules, whereas they should be able to learn from their adversaries' exploits, analyse them and then adapt appropriate countermeasures.
Faced with these new threats, humans alone are no longer enough, and as in other sectors, artificial intelligence is proving invaluable in supporting analysts.
This science is booming thanks to the power of servers capable of processing ever larger volumes of data, but that's not all: "what characterises machine learning today is the increased involvement of business experts and scientists to feed the artificial intelligence algorithms. This combination of new knowledge and computing power creates a favourable environment for the development of AI in security. "
Data makes you intelligent
In practice, operational security analyses every trace left by the various systems, and each of these traces can provide a security alert, although these traces are heterogeneous. One of the roles assigned to AI is to automate human activities and organise information so that it is useful and accessible.
"The challenge in security is to standardise the data collected and give it meaning to facilitate the analyst's decision. In deep learning, we use thousands of pieces of data to 'feed' the algorithms and make them more intelligent. The advantage is that, in short, the algorithms are capable of recognising and characterising classes of data from the moment they are given examples of 'what to do' or 'what not to do'. This principle can be implemented after following a "training phase", which consists of "feeding" the algorithm with several thousand data items labelled as classes. As Google says, an algorithm is useless if you don't 'feed' it data to model it.
Ontology has class
Ontologies are useful models for guiding algorithms in their learning process. They are organised into classes which, when combined, give meaning to the elements belonging to these classes. For example, a sentence in the French language has 3 classes: subject, verb, complement; where each element is labelled as such and will contribute by this means to understanding the role of each word in the sentence and by this same path to giving a meaning and therefore understanding the meaning of the sentence. Learning algorithms use statistics to evaluate the frequency of classes. In our example, we can see that the expression "he" will very often be labelled as a "subject" class by the language expert. When the word "il" comes up for translation, the algorithm will assign it this class.
Following this classification, the AI algorithms have not finished their work; they still need to link the 3 classes: "subject, verb, complement" to give meaning to the sentence, and the correlation algorithms will do this. This phase is certainly more delicate and the algorithm is more complex. Correlation involves rules linked to the expert's job, mixing rules based on experience, exceptions, etc. It is difficult to model these rules so that they are valid all the time and unambiguous for the machine and the human. For example, in our example, the 'subject' class may be inverted, so how do you teach the machine all the linguistic rules that lead to an 'inverted subject'? In the case of security, it's the attackers who give us this data. Just like what Google does, we look for classes of information from the data (traces) generated by the attackers to understand and make the algorithm learn."
From structured databases to Big Data
Véronique Legrand uses the example of Google's translation tool to explain this concept: "In the past, we used structured databases where a query targeted a given field to obtain the corresponding result. In the field of big data databases, this principle would not be viable. Today, a field is not exactly what we think it is; each field belongs to a class or classification, and queries are based on ontologies (trees) to guide reasoning. The algorithm itself may not be complex or powerful. What counts is enriching it with examples. Google enriches its translation algorithm with data labelled by language experts. The richer it becomes, the more it will eventually converge to produce increasingly refined results. "
Hackers also use artificial intelligence
In security, however, the problem is more complex. To return to the question of the quantity and variety of logs that operational security teams have to deal with, it might be tempting to use simple translation elements to automatically translate logs between different languages and make them converge. On the other hand, hackers are capable of generating 'words' to confuse these automatic translation attempts and make it appear as if everything is normal. This forces researchers to check the translation algorithms beforehand in order to validate the operation of the statistical engines.
Unlike a translation tool, where it is in everyone's interest to be as close as possible to reality, security is confronted with humans on both sides of the network, and everyone is trying to fool the other. Worse still. As Véronique Legrand reminds us, "All machine learning tools are published and potentially known to attackers, who can detect how we have configured our tools. For example, if they send us a web request, the way in which we respond will be a clue for them. We also set up honeypots, but they may have the same approach. That's why continuous improvement is so important, because it enables us to understand the overall context and to programme algorithms that are difficult to detect" moderates Véronique Legrand, who nevertheless qualifies this statement: "With the proliferation of vulnerabilities, humans alone can no longer do the job. Ideally, in terms of protection, we should be in the same situation as 15 years ago, with reasonable alert flows, but with today's correlation tools. Ideally, for good protection, we need to act on the sources of security systems so that they are better able to raise alerts that are relevant to humans. The other aspect is to better specify security systems at the design stage so that they are better equipped to participate in global self-defence, and the third condition is to have artificial intelligence that can draw on data from heterogeneous sources. "
These are avenues being explored by a number of publishers and SOCs, including Intrinsec. In this face-off between the attacker and the company, AI is a valuable aid, but contrary to what we sometimes read, the human element remains indispensable.
"The tool is intelligent, but only because humans make it evolve. Only humans can devise counter-attacks based on the attack," concludes Véronique Legrand.