Stringy applies semantic analysis to extracted strings, identifying patterns that indicate specific types of data. This helps analysts focus on the most relevant information quickly.
Raw String -> Pattern Matching -> Validation -> Tag Assignment
- Pattern:
https?://[^\s<>"{}|\\\^\[\]\]+` - Examples:
https://example.com/path,http://malware.site/payload - Validation: Must start with
http://orhttps://
- Pattern: RFC 1035 compliant domain format
- Examples:
example.com,subdomain.evil.site - Validation: Valid TLD from known list, not a URL or email
- IPv4 Pattern: Standard dotted-decimal notation
- IPv6 Pattern: Full and compressed formats
- Examples:
192.168.1.1,::1,2001:db8::1 - Validation: Valid octet ranges for IPv4, proper format for IPv6
- POSIX Pattern: Paths starting with
/ - Windows Pattern: Drive letters (
C:\) or relative paths - UNC Pattern:
\\server\shareformat - Examples:
/etc/passwd,C:\Windows\System32,\\server\share\file
- Pattern:
HKEY_*orHK*\prefixes - Examples:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft - Validation: Must start with valid registry root key
- Pattern:
\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\} - Examples:
{12345678-1234-1234-1234-123456789abc} - Validation: Strict format compliance with braces required
- Pattern:
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} - Examples:
admin@malware.com,user.name+tag@example.co.uk - Validation: Single
@, valid TLD length and characters, no empty parts
- Pattern:
[A-Za-z0-9+/]{20,}={0,2} - Examples:
U29tZSBsb25nZXIgYmFzZTY0IHN0cmluZw== - Validation: Length >= 20, length divisible by 4, padding rules, entropy threshold
- Pattern:
%[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\} - Examples:
Error: %s at line %d,User {0} logged in - Validation: Reasonable specifier count, context-aware thresholds
- Pattern:
Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+ - Examples:
Mozilla/5.0 (Windows NT 10.0; Win64; x64),Chrome/117.0.5938.92 - Validation: Known browser identifiers and minimum length
The semantic classifier uses cached regex patterns via once_cell::sync::Lazy and applies validation checks to reduce false positives.
use once_cell::sync::Lazy;
use regex::Regex;
static GUID_REGEX: Lazy<Regex> = Lazy::new(|| {
Regex::new(r"^\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}$")
.expect("Invalid GUID regex")
});use stringy::classification::SemanticClassifier;
use stringy::types::{BinaryFormat, Encoding, SectionType, StringContext, StringSource, Tag};
let classifier = SemanticClassifier::new();
let context = StringContext::new(
SectionType::StringData,
BinaryFormat::Elf,
Encoding::Ascii,
StringSource::SectionData,
)
.with_section_name(".rodata".to_string());
let tags = classifier.classify("{12345678-1234-1234-1234-123456789abc}", &context);
if tags.contains(&Tag::Guid) {
// Handle GUID indicator
}- GUID: Braced, hyphenated, hex-only format.
- Email: TLD length must be between 2 and 24 and alphabetic; domain must include a dot.
- Base64: Length must be divisible by 4, padding allowed only at the end, entropy threshold applied.
- Format String: Must contain at least one specifier and pass context-aware length checks.
- User Agent: Must contain a known browser token and meet minimum length.
- Regexes are compiled once via
once_cell::sync::Lazyand reused across calls. - Minimum length checks avoid unnecessary regex work on short inputs.
- The classifier is stateless and thread-safe.
- Unit tests:
tests/classification_tests.rs - Integration tests:
tests/classification_integration_tests.rs
Run tests with:
just test