Detecting Unusual URL Sequences in Cloudflare Logs Using Markov Chains and Caching

BY

Asante Babers

/

Apr 25, 2025

/

Detection Engineering

/

5 Min

Read

Detecting Unusual URL Sequences in Cloudflare Logs Using Markov Chains and Caching

When I first started looking for a way to detect automated attacks like bots or scripted attacks in web traffic, I ran into a familiar problem: most of the available solutions were either too complex (machine learning models, heavy SIEM setups) or too rigid (static pattern matching, manual rule writing). I needed something practical, lightweight, and, importantly, Pythonic.

Then I had a breakthrough: what if I could detect abnormal sequences of URLs accessed by the same IP address? After all, bots and automated scripts tend to follow very predictable patterns, whereas legitimate user behavior is often more erratic. That’s when I realized: I could use a simple technique from statistics—Markov Chains—to track these sequences and flag the ones that looked suspicious.

🔍 What Are We Looking For?

In automated attacks, bots often follow a predictable order when accessing URLs. For example, a bot might visit /login, then /dashboard, then /profile, over and over again. This sequence of URLs is the bot’s “signature,” and if we catch it, we can raise an alert.

Unlike human users, who might jump from one URL to another without following a set order, bots tend to follow predictable paths. This is where Markov Chains come in handy. By tracking these sequences of accessed URLs, we can detect if an IP is behaving in an abnormal, scripted way.

⚙️ The Power of Markov Chains in Web Traffic

Markov Chains are a statistical model that help predict the likelihood of a sequence of events, where the next event depends only on the current state. In our case, the “state” is the URL being accessed, and the sequence is the order in which URLs are visited by an IP.

Instead of implementing a full Markov Chain algorithm, we can use a simpler approach by just tracking sequences of two URLs (e.g., current URL → next URL) and checking if any sequence occurs too frequently within a short time window. This simple method is efficient enough for real-time traffic analysis, especially when combined with Panther’s caching system.

🧪 The Game Plan

Here’s the strategy I used to detect unusual URL sequences:

  • Track sequences of URLs accessed by each IP.

  • Cache these sequences using Panther’s caching helpers.

  • Check for repeated patterns in the sequences.

  • Raise an alert if a sequence appears too frequently.

Let’s break it down into smaller steps.

🧑‍💻 Step-by-Step Code Breakdown

Step 1: Import Caching Helpers

The first thing we need is Panther’s caching helpers. These let us store and retrieve data (like previously seen URL sequences) efficiently. This caching helps avoid having to process the same data repeatedly.


Step 2: Track IP and URL Sequences

For each log event, we extract the IP address, URL being accessed, and timestamp.

def detect_unusual_sequence_patterns(event, time_window=15, threshold=5):
    ip = event['sourceIPAddress']
    current_url = event['resource']
    timestamp = event['timestamp']

Step 3: Build a Sequence Key and Cache First Access

We generate a unique cache key for each IP and URL combination and store the timestamp.


Step 4: Time-Based Detection

We compare timestamps to make sure we're within a relevant time window before escalating.


Step 5: Pattern Detection Using a Counter

Instead of misusing a set as a dict, we use a proper counter cache for each sequence:


Step 6: Update the Cache

If the sequence isn’t over the threshold yet, we still record it.


🚨 Alerting and Context

Once we detect a suspicious pattern, we need to generate an alert. Here's how we do that:

def title(event):
    return f"Cloudflare: Detected Unusual URL Sequence from IP [{event.get('sourceIPAddress', '<NO_CLIENTIP>')}]"

def alert_context(event):
    return {
        "IP": event.get('sourceIPAddress', '<NO_CLIENTIP>'),
        "URL": event.get('resource', '<NO_RESOURCE>'),
        "Timestamp": event.get('timestamp', '<NO_TIMESTAMP>'),
        "EventDetails": event
    }

def severity(event):
    sequence_key = f"{event['sourceIPAddress']}-{event['resource']

These functions generate:

  • A clear title that identifies the suspicious IP and activity.

  • A structured context payload for investigation.

  • Severity logic based on how many times a pattern has repeated.

✅ Full Detection Code (Panther-Ready)

You can find the full detection code here.

🔍 Why This Works

This approach is simple yet effective:

  • Caching: Panther’s caching system avoids redundant computation, enabling near real-time detection.

  • Granularity: You can tune time_window and threshold to fit your environment’s normal behavior.

  • Efficiency: No need for fancy ML—just smart counters and a little statistical thinking.

🎵 Conclusion

By using Markov Chains and Panther’s caching features, we can detect automated bot behavior or scripted attacks in web traffic—without the complexity of traditional machine learning models. With just a bit of Python and some clever thinking, we can identify patterns that would otherwise go unnoticed.

Let the machines do the work. We'll be listening for those patterns.

Asante

©2023 Asante Babers

©2023 Asante Babers