Building Your Own Signal Detection Rules: A Data-Centric Approach to Cybersecurity (Part I)

Use Cases & Projects, Dataiku Product, Scaling AI Kiel Guédec, Jeremy Scion, Karim Battata

As data becomes more available but also grows in size and is continuously more dense and unstructured — specifically in a context where more than just network logs are required to perform accurate anomaly detection and threat intelligence — the cybersecurity industry is aiming to introduce more automation into the process to provide security experts with actionable information for identifying patterns, threats, attackers, and more.

Automation implies being able to design, deploy, and monitor a data flow. It also means leveraging machine learning as often as possible and orchestrating it in what is now becoming the so-called MLOps standard.

Storage is getting cheaper and allows stakeholders to ingest and enrich their analyses with a myriad of valuable data sources (i.e., fraud, business units, social networks, and so on), making it attractive for Chief Information Security Officers (CISOs) to customize or build their own rules management solutions the way Chief Data Officers (CDOs) have done with data.

What You'll Read in This Blog Post:

  • Limitations With Legacy Solutions
  • Best Practices: A Data-Centric Design
  • Use Case: A Very Simple Threat Detection Example With Dataiku and Elasticsearch
    • Objective
    • Our Input Data
    • Creating a Set of Static Rules
    • Applying Your Rules to Your Data
    • Generating Your IOC Dataset
    • Exposing Our Compromised IPs to Our SOC Analysts
  • Benefits of a Data-Centric Solution for Threat Detection
  • Next Steps: Scaling Further With Machine Learning and DataOps for Rules Management

Limitations With Legacy Solutions

Let’s start with the data. It can live anywhere — across different data warehouses, maybe it’s already in use and being processed by another business unit, or it simply exists on the internet, requiring you to develop your own in-house connectors and data pipelines.

Next, the data consumers and producers come into play. They may not have plain access to sensitive logs and still need valuable insights to enrich with their data model. They also may not be trained on the legacy solution being used or are more inclined to deploy large, open source frameworks such as Hadoop and Spark.

Eventually, the non-tech audience (in this case, security experts) will need to provide feedback to false positives. They are the valuable experts who will help organizations bootstrap their solutions quickly and are the source of truth to whether a system is robust and accurate or not.

Ideally, this very diverse population would work together on the same project, collaborating in the same platform, focusing on their unique areas of expertise. While this is not impossible, it does come with its own set of best practices.

Best Practices: A Data-Centric Design

A data-centric architecture makes it possible for a diverse range of stakeholders to collaborate on the same project, as illustrated in the visual below.

collaboration on a data project

Dataiku DSS facilitates such deployment, as it provides built-in orchestration and data connection features that we will be using for our proof of concept.

In particular, the use of an orchestrator such as Dataiku DSS allows users to seamlessly source any kind of data (blog storage, SQL data warehouses, NoSQL databases, streaming, and so on), build machine learning pipelines, version with Git, and separate design from production.

We will now see how this approach can be beneficial with the example of a simple threat detection use case.

Use Case: A Very Simple Threat Detection Example With Dataiku and Elasticsearch

Objective

We have access to web proxy logs stored in Elasticsearch. The main goal is to detect a malware-infected machine via these web communications. To do this effectively, we will rely on the common web patterns that originate from those infected machines and apply detection rules to our data. We will eventually expose our results to analysts for deeper investigation.

Our very simple flow in Dataiku DSS will look like this:

simple flow in Dataiku DSS

Our Input Data

We source the data from Elasticsearch creating a dataset in Dataiku DSS. This data matches the logs of our web proxy. Logs contain harmless traffic but also connections to Command and Control Servers (C&Cs) from machines infected by different malware.

creating a dataset in Dataiku DSS

For the purpose of this example, we cleansed the data before to look more CIM-compliant, but it contains the following fields:

  • Timestamp
  • Src_ip
  • Dest_ip
  • Dest_port
  • http_method
  • Dest : server’s fqdn
  • Uri_path
  • Uri_query
  • Url : concat of dest + Uri_path + Uri_query
  • http_user_agent
  • http_content_type : mime type
  • http_referrer :
  • status : http status code
  • reason : http reason

cleansed the dataset sample

Creating a Set of Static Rules

It is as simple as writing the rules in a .csv and uploading the file or, even better, creating an editable dataset in Dataiku DSS and writing them there. The dataset file will contain the different malware patterns that we want to cross-reference with the proxy logs.

dataset file in Dataiku DSS

dataset sample in Dataiku DSS

For example, the first line detects the Loki malware based on the User-Agent, the destination port, and the http method. The description and rule columns will enrich the dataset and make the detection more explicit.

Applying Your Rules to Your Data

Matching rules to your data is as easy as joining the two datasets. Here, we will tell Dataiku DSS to join them based on a custom SQL condition and we will then filter the result of the query to keep only the lines that exhibit an actual match.

matching rules to data in Dataiku DSS

Everything happens in this single “join” recipe — the filter is added after the join is executed, as seen below:

Dataiku DSS join recipe

adding filter in Dataiku DSS

Generating Your IOC Dataset

Now that our rules match our data, we will simply aggregate the data and count! We will create a “group by” recipe that will count the number of matches for each tuple (src_ip, url, dest_ip, status) and add the description and rule values from the pattern .csv file.

aggregating data in Dataiku DSS

group by recipe in Dataiku DSS

Then, we will add the “concat distinct” option to remove duplicates.

removing duplicates in Dataiku DSS

After that, we apply a filter to ensure the entries match more than 10 times (this is done in the post-filter section of the recipe). Note, more sophisticated techniques including ML scoring can be used here as well.

Our IOC dataset now looks like this:

IOC dataset in Dataiku DSS

Exposing Our Compromised IPs to Our Security Operations Center (SOC) Analysts

Next, we would like to display the data nicely to a security operations analyst — let’s create a dashboard for her in a few clicks, showing the compromised IP sources.

dashboard in Dataiku DSS

Then, we’ll publish it and voilà, the dashboard is ready!

dashboard in Dataiku DSS

Benefits of a Data-Centric Solution for Threat Detection

This simple use case illustrates great benefits to implementing threat detection on top of data-centric architectures, including:

  • Some team members can focus on building pipelines while others can work on applying machine learning.
  • Using “regular” SQL for rules matching makes it much more portable to other storage and backend options.
  • There is no need to put all the data in a single storage location.
  • Adding a rule is as simple as writing it directly in your dataset, you get the threat and the type without having to rely on multiple vertical solutions.

Next Steps: Scaling Further With Machine Learning and DataOps for Rules Management

We have seen how simple log ingestion can be processed to deliver quick insights valuable for a POC.

In a follow up to this post, we will see how we can go beyond static rules to enhance our solution with machine learning scoring and how we can iterate over our design to push it to production incrementally. In the meantime, feel free to send any comments, feedback, or suggestions to nicolas.pommier@dataiku.com.

You May Also Like

AI Isn't Taking Over, It's Augmenting Decision-Making

Read More

The Ultimate Test of ChatGPT

Read More

Maximize GenAI Impact in 2025 With Strategy and Spend Tips

Read More

Taming LLM Outputs: Your Guide to Structured Text Generation

Read More