Skip to content

Add AI bot classification for event enrichment#253

Open
jaredmixpanel wants to merge 7 commits intomasterfrom
feature/ai-bot-classification
Open

Add AI bot classification for event enrichment#253
jaredmixpanel wants to merge 7 commits intomasterfrom
feature/ai-bot-classification

Conversation

@jaredmixpanel
Copy link
Contributor

@jaredmixpanel jaredmixpanel commented Feb 19, 2026

Summary

Adds AI bot classification middleware/integration that automatically detects AI crawler requests (GPTBot, ClaudeBot, PerplexityBot, etc.) and enriches tracked events with classification properties.

What it does

  • Classifies user-agent strings against a database of 12 known AI bots
  • Enriches events with $is_ai_bot, $ai_bot_name, $ai_bot_provider, and $ai_bot_category properties
  • Supports custom bot patterns that take priority over built-in patterns
  • Case-insensitive matching

AI Bots Detected

GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-User, Google-Extended, PerplexityBot, Bytespider, CCBot, Applebot-Extended, Meta-ExternalAgent, cohere-ai

Implementation Details

Architecture

  • enable_bot_classification(mixpanel) wraps send_event_request on the client instance — the single chokepoint for track() and import() calls
  • Enrichment is triggered when the event properties contain a user-agent key (default $user_agent)
  • Returns { enable(), disable() } controller for runtime toggling
  • Includes a _ai_bot_classification_enabled guard on the mixpanel instance to prevent double-wrapping

Public API

ai_bot_classifier module:

Export Signature Description
classify_user_agent(userAgent) (string | null | undefined) => AiBotClassification Classify a single UA string against the built-in 12-bot database
create_classifier(options) ({ additional_bots?: AiBotEntry[] }) => (userAgent: string) => AiBotClassification Factory that returns a classify function with custom patterns prepended to the built-in list
get_bot_database() () => AiBotEntry[] Returns a copy of the built-in bot database (pattern, name, provider, category, description)

ai_bot_middleware module:

Export Signature Description
enable_bot_classification(mixpanel, options?) (mixpanel, BotClassificationOptions?) => BotClassificationController Wraps send_event_request to auto-enrich events; returns { enable(), disable() }
track_request(mixpanel, req, eventName, properties?, callback?) (mixpanel, IncomingMessage, string, object?, Function?) => void Helper that extracts user-agent and IP from an HTTP request and calls mixpanel.track()

BotClassificationOptions:

  • user_agent_property (default "$user_agent") — property key to read the UA from
  • property_prefix (default "$") — prefix for injected classification properties
  • additional_bots — array of { pattern: RegExp, name, provider, category } checked before built-in bots

Notable Design Decisions

  • Wraps send_event_request, not track(): This catches both track() and import() calls through a single interception point, avoiding the need to wrap multiple methods
  • Custom bots prepended, not appended: create_classifier spreads additional_bots before AI_BOT_DATABASE so custom patterns take priority over built-in ones
  • Regex patterns require trailing slash: All built-in patterns match BotName/ (e.g. /GPTBot\//i) to avoid false positives on substrings — the trailing slash is part of the standard bot UA version token

Usage Examples

Automatic Event Enrichment

const Mixpanel = require("mixpanel");
const { enable_bot_classification } = require("mixpanel/lib/ai_bot_middleware");

const mixpanel = Mixpanel.init("YOUR_TOKEN");
const controller = enable_bot_classification(mixpanel);

// All track() and import() calls now auto-enrich when $user_agent is present
mixpanel.track("Page View", {
  $user_agent: "Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.2; +https://openai.com/gptbot)",
  distinct_id: "user-123",
});
// => event properties include: { $is_ai_bot: true, $ai_bot_name: "GPTBot", $ai_bot_provider: "OpenAI", $ai_bot_category: "indexing", ... }

// Runtime toggle
controller.disable(); // pause enrichment
controller.enable();  // resume enrichment

Standalone Classification

const { classify_user_agent } = require("mixpanel/lib/ai_bot_classifier");

const result = classify_user_agent("ClaudeBot/1.0");
// => { $is_ai_bot: true, $ai_bot_name: "ClaudeBot", $ai_bot_provider: "Anthropic", $ai_bot_category: "indexing" }

const notBot = classify_user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/120.0");
// => { $is_ai_bot: false }

const empty = classify_user_agent(null);
// => { $is_ai_bot: false }

Custom Bot Patterns

const { create_classifier } = require("mixpanel/lib/ai_bot_classifier");

const classify = create_classifier({
  additional_bots: [
    {
      pattern: /MyInternalBot\//i,
      name: "MyInternalBot",
      provider: "Acme Corp",
      category: "agent",
    },
  ],
});

classify("MyInternalBot/2.0 (internal crawler)");
// => { $is_ai_bot: true, $ai_bot_name: "MyInternalBot", $ai_bot_provider: "Acme Corp", $ai_bot_category: "agent" }

Framework Integration (Express)

const express = require("express");
const Mixpanel = require("mixpanel");
const { enable_bot_classification, track_request } = require("mixpanel/lib/ai_bot_middleware");

const app = express();
const mixpanel = Mixpanel.init("YOUR_TOKEN");
enable_bot_classification(mixpanel);

app.get("/api/content", (req, res) => {
  // Extracts user-agent header and IP from req, then calls mixpanel.track()
  // Bot classification happens automatically via the send_event_request wrapper
  track_request(mixpanel, req, "API Content Served", {
    distinct_id: req.query.user_id || "anonymous",
    path: req.path,
  });
  res.json({ content: "..." });
});

Files Added

  • lib/ai_bot_classifier.d.ts
  • lib/ai_bot_classifier.js
  • lib/ai_bot_middleware.d.ts
  • lib/ai_bot_middleware.js
  • test/ai_bot_classifier.js
  • test/ai_bot_middleware.js

Files Modified

  • lib/mixpanel-node.d.ts
  • lib/mixpanel-node.js

Test Plan

  • All 12 AI bot user-agents correctly classified
  • Non-AI-bot user-agents return $is_ai_bot: false (Chrome, Googlebot, curl, etc.)
  • Empty string and null/nil inputs handled gracefully
  • Case-insensitive matching works
  • Custom bot patterns checked before built-in
  • Event properties preserved through enrichment
  • No regressions in existing test suite

Part of AI bot classification feature for Node.js SDK.
Part of AI bot classification feature for Node.js SDK.
Part of AI bot classification feature for Node.js SDK.
Part of AI bot classification feature for Node.js SDK.
Fix CI formatting check for AI bot classification files.
Fix CI lint check in AI bot middleware.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds AI bot classification middleware to automatically detect and enrich events from AI crawler requests (GPTBot, ClaudeBot, PerplexityBot, etc.). It implements a pattern-based user-agent classifier that enriches Mixpanel events with bot detection properties when the $user_agent field is present.

Changes:

  • Adds a classifier module that matches user-agent strings against 12 known AI bot patterns
  • Implements middleware that wraps send_event_request to enrich events with classification properties
  • Exports the new functionality via Mixpanel.ai and Mixpanel.AiBotClassifier

Reviewed changes

Copilot reviewed 5 out of 8 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
lib/ai_bot_classifier.js Core classification logic with AI bot pattern database and matching functions
lib/ai_bot_classifier.d.ts TypeScript type definitions for classifier functions and interfaces
lib/ai_bot_middleware.js Middleware implementation that wraps send_event_request and helper for HTTP request tracking
lib/ai_bot_middleware.d.ts TypeScript type definitions for middleware functions and options
lib/mixpanel-node.js Exports the new ai middleware and AiBotClassifier modules
lib/mixpanel-node.d.ts TypeScript module augmentation attempting to add new exports
test/ai_bot_classifier.js Comprehensive tests for classifier covering all 12 bot patterns and edge cases
test/ai_bot_middleware.js Integration tests for middleware including configuration options and limitations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Add missing $ai_bot_category assertions for 4 bot tests
- Prevent mutation of input properties in track_request
- Add double-wrapping guard for enable_bot_classification
- Fix JSDoc comment accuracy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants