Detecting Unusual URL Sequences in Cloudflare Logs Using Markov Chains and Caching
BY
Asante Babers
/
Apr 25, 2025
/
Detection Engineering
/
5 Min
Read
Detecting Unusual URL Sequences in Cloudflare Logs Using Markov Chains and Caching
When I first started looking for a way to detect automated attacks like bots or scripted attacks in web traffic, I ran into a familiar problem: most of the available solutions were either too complex (machine learning models, heavy SIEM setups) or too rigid (static pattern matching, manual rule writing). I needed something practical, lightweight, and, importantly, Pythonic.
Then I had a breakthrough: what if I could detect abnormal sequences of URLs accessed by the same IP address? After all, bots and automated scripts tend to follow very predictable patterns, whereas legitimate user behavior is often more erratic. That’s when I realized: I could use a simple technique from statistics—Markov Chains—to track these sequences and flag the ones that looked suspicious.
🔍 What Are We Looking For?
In automated attacks, bots often follow a predictable order when accessing URLs. For example, a bot might visit /login
, then /dashboard
, then /profile
, over and over again. This sequence of URLs is the bot’s “signature,” and if we catch it, we can raise an alert.
Unlike human users, who might jump from one URL to another without following a set order, bots tend to follow predictable paths. This is where Markov Chains come in handy. By tracking these sequences of accessed URLs, we can detect if an IP is behaving in an abnormal, scripted way.
⚙️ The Power of Markov Chains in Web Traffic
Markov Chains are a statistical model that help predict the likelihood of a sequence of events, where the next event depends only on the current state. In our case, the “state” is the URL being accessed, and the sequence is the order in which URLs are visited by an IP.
Instead of implementing a full Markov Chain algorithm, we can use a simpler approach by just tracking sequences of two URLs (e.g., current URL → next URL) and checking if any sequence occurs too frequently within a short time window. This simple method is efficient enough for real-time traffic analysis, especially when combined with Panther’s caching system.
🧪 The Game Plan
Here’s the strategy I used to detect unusual URL sequences:
Track sequences of URLs accessed by each IP.
Cache these sequences using Panther’s caching helpers.
Check for repeated patterns in the sequences.
Raise an alert if a sequence appears too frequently.
Let’s break it down into smaller steps.
🧑💻 Step-by-Step Code Breakdown
Step 1: Import Caching Helpers
The first thing we need is Panther’s caching helpers. These let us store and retrieve data (like previously seen URL sequences) efficiently. This caching helps avoid having to process the same data repeatedly.
Step 2: Track IP and URL Sequences
For each log event, we extract the IP address, URL being accessed, and timestamp.
Step 3: Build a Sequence Key and Cache First Access
We generate a unique cache key for each IP and URL combination and store the timestamp.
Step 4: Time-Based Detection
We compare timestamps to make sure we're within a relevant time window before escalating.
Step 5: Pattern Detection Using a Counter
Instead of misusing a set as a dict, we use a proper counter cache for each sequence:
Step 6: Update the Cache
If the sequence isn’t over the threshold yet, we still record it.
🚨 Alerting and Context
Once we detect a suspicious pattern, we need to generate an alert. Here's how we do that:
These functions generate:
A clear title that identifies the suspicious IP and activity.
A structured context payload for investigation.
Severity logic based on how many times a pattern has repeated.
✅ Full Detection Code (Panther-Ready)
You can find the full detection code here.
🔍 Why This Works
This approach is simple yet effective:
Caching: Panther’s caching system avoids redundant computation, enabling near real-time detection.
Granularity: You can tune
time_window
andthreshold
to fit your environment’s normal behavior.Efficiency: No need for fancy ML—just smart counters and a little statistical thinking.
🎵 Conclusion
By using Markov Chains and Panther’s caching features, we can detect automated bot behavior or scripted attacks in web traffic—without the complexity of traditional machine learning models. With just a bit of Python and some clever thinking, we can identify patterns that would otherwise go unnoticed.
Let the machines do the work. We'll be listening for those patterns.
— Asante