Architecture & Internals

sekretbarilo is a high-performance secret scanner written in Rust. This page explains how it works internally, from the ground up.

Project Structure

src/
  main.rs           - cli entry point, hand-rolled command parsing
  lib.rs            - library exports
  agent/mod.rs      - agent hook support (claude code integration)
  audit/
    mod.rs          - working tree scanning
    history.rs      - git history scanning with branch resolution
  config/
    mod.rs          - config loading and merging
    allowlist.rs    - allowlist compilation
    discovery.rs    - hierarchical config file discovery
    merge.rs        - config merge logic
    rules.toml      - 109 built-in detection rules
  diff/
    mod.rs          - git diff retrieval
    parser.rs       - unified diff parser
  doctor/mod.rs     - diagnostic health checks
  hook/mod.rs       - git pre-commit hook installation
  output/
    mod.rs          - finding reports
    masking.rs      - secret value masking
  scanner/
    engine.rs       - main scanning pipeline
    rules.rs        - rule compilation
    entropy.rs      - shannon entropy calculation
    hash_detect.rs  - hash detection (sha-1, sha-256, md5)
    password.rs     - password strength heuristics
    pubkey.rs       - public key block detection and tracking

Scanning Pipeline (Core Engine)

The scanner processes files through a multi-stage pipeline optimized for speed and accuracy:

1. Git Diff Retrieval

git diff --cached --unified=0 --diff-filter=d

retrieves only staged changes
--unified=0 minimizes context (we only scan added lines)
--diff-filter=d excludes deleted files (no scanning needed)

2. Unified Diff Parsing

splits diff into DiffFile structs with metadata:
- path: file path
- is_new, is_deleted, is_renamed, is_binary: file status flags
- added_lines: vector of AddedLine structs with line_number and content
handles multi-file diffs, multiple hunks per file, and edge cases:
- binary files (via Binary files ... differ marker)
- renames (extracts final path from +++ b/ header)
- root commits (with --root flag)

3. .env File Blocking

immediate, unconditional block for .env, .env.local, .env.production
excludes .env.example, .env.template (safe examples)
prevents accidental exposure before scanning

4. Global Path Allowlist Check

pre-filters files to skip scanning:

binary extensions: .png, .jpg, .exe, .so, .woff2, etc.
vendor directories: node_modules/, vendor/, .venv/, etc.
generated files: package-lock.json, Cargo.lock, minified .min.js
documentation: README.md, docs/, *.rst (with entropy bonus)

5. Public Key Block Suppression

tracks multi-line PEM/PGP public key blocks via PubKeyBlockTracker
when detect_public_keys is disabled (default), lines inside public key blocks are skipped entirely
prevents base64 content in public keys from triggering token rules (e.g., EAA → facebook-access-token)
also detects single-line OpenSSH public keys (ssh-rsa AAAA..., etc.)
gated rules (pem-public-key, pgp-public-key-block, openssh-public-key) are skipped unless enabled

6. Aho-Corasick Keyword Pre-filter

single-pass scan across all rules’ keywords simultaneously
case-insensitive matching via aho-corasick automaton
builds a bitset of “candidate rules” whose keywords matched
drastically reduces regex evaluations (only ~2-5% of lines trigger regex)

Example: Line contains “akia” → activates aws-access-key-id rule

7. Regex Evaluation

only evaluates regexes for rules whose keywords matched
uses regex::bytes crate for binary-safe matching
extracts full matches or capture groups via secret_group field
handles multiple matches per line (iterates captures_iter)

Example: (AKIA[A-Z0-9]{16}) matches AKIAIOSFODNN7EXAMPLE

8. Secret Extraction

if secret_group > 0, extracts capture group value
if secret_group == 0, uses full match
skips empty matches

9. Per-Rule Allowlist Check

each rule can define:

value regexes: patterns to match against extracted secret (e.g., AKIAIOSFODNN7EXAMPLE)
path patterns: file path regexes to skip (e.g., test/.*)

10. Variable Reference Detection

skips values that are template variables, not real secrets:

$VAR, ${VAR}, %VAR% (shell variables)
process.env.VAR (node.js)
os.environ['VAR'] (python)
ENV['VAR'] (ruby)

11. Stopword Filtering

rules with entropy_threshold (tier 2+) check for common safe words:

built-in: test, example, fake, placeholder, changeme, dummy, mock
user-configurable via [allowlist] stopwords = [...]
tier 1 rules (no entropy threshold) only check placeholder patterns (XXXX..., ****...) to allow tokens like sk_test_ that inherently contain “test”

12. Hash Detection

prevents false positives on git commit hashes and checksums:

full-length hashes: 32 (md5), 40 (sha-1), 64 (sha-256) hex chars
abbreviated hashes: 7-12 hex chars
requires context keywords on the same line: commit, sha, hash, checksum, digest, integrity
uses word-boundary matching to avoid false matches (hash inside HashMap)

Example: sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 → skipped

13. Password Strength Heuristics

for generic-password-assignment and password-in-url rules only:

weak passwords allowed: password, admin, 123456, changeme
strong passwords blocked: complex passwords with high entropy + character classes
scoring:
- shannon entropy (0-5)
- character class bonus (uppercase + lowercase + digits + special ≥ 3 → +1.0, all 4 → +2.0)
- length bonus (≥12 chars → +0.5, ≥20 chars → +1.0)
- dictionary penalty (-4.0)
- threshold: 6.0

Rationale: password=test is a placeholder, password=Kj8#mP2!xQ9vL4nR is a real secret

14. Shannon Entropy Evaluation

for rules with entropy_threshold set:

calculates shannon entropy over all 256 byte values
min length: 20 characters (shorter strings skip entropy check)
documentation file bonus: +1.0 to threshold (raises bar for false positives in docs)
global override: --entropy-threshold or [settings] entropy_threshold = 3.5 sets a floor

Formula: H = -Σ(p_i * log2(p_i)) where p_i is frequency of byte i

Example: aaaaaaaaaaaaaaaaaaaaaaaa → entropy ≈ 0.0 (blocked), aB3dEf7hIj1kLmN0pQrStUvWxYz → entropy ≈ 4.2 (allowed)

15. Output with Masking

secret values always masked: AK**************FG (first 2 + last 2 chars)
short values (≤4 chars): fully masked with xxxx
prefixes: [ERROR] (scan), [AUDIT] (audit), [AGENT] (check-file)

Audit Pipeline

Working Tree Mode (`sekretbarilo audit`)

enumerate tracked files: git ls-files -z (nul-delimited for safe filenames)
optional: include ignored files: git ls-files -z --others --ignored --exclude-standard
apply exclude/include patterns: filter via regex (from [audit] section)
read files in parallel: rayon thread pool, converts to synthetic DiffFile structs
feed through scanner engine: same pipeline as scan mode
report findings: grouped by file, with masked values

Optimizations:

parallel file reads via rayon (4+ files)
binary detection: checks first 8KB for null bytes
error handling: read failures reported but don’t stop scanning

Git History Mode (`sekretbarilo audit --history`)

list commits: git rev-list --all --format=%H%n%an%n%ae%n%aI%n%at
- optional filters: --branch, --since, --until
- parses: hash, author, email, date, timestamp (unix)
identify root commits: git rev-list --max-parents=0 --all
- required for correct scanning (root commits need --root flag)
extract per-commit diffs: git diff-tree -p --no-commit-id -r --unified=0 --diff-filter=d -m <hash>
- -m splits merge commits into individual diffs per parent (catches secrets in conflict resolution)
- --root for root commits
parallel commit processing: rayon parallelizes commit scanning
- progress reporting every 50 commits
- error count tracked atomically
deduplication: same secret + same file + same rule = keep earliest commit by timestamp
- key: (file, rule_id, matched_value) → earliest HistoryFinding
- sorts by timestamp, then commit hash, then file, then line
branch resolution: git branch --contains <hash> --format=%(refname:short)
- only queries commits with findings (not all commits)
- parallel resolution via rayon
- best-effort: failures warned but don’t stop reporting
report findings: grouped by commit, shows author email + branches
- sanitizes output: strips control chars and bidi overrides (prevents terminal injection)

Check-File Pipeline (Agent Hooks)

Triggered by claude code when reading a file via the Read tool.

Input Modes

stdin json (--stdin-json): reads hook payload from stdin

{
  "tool_input": {"file_path": "/path/to/file.rs"},
  "cwd": "/project/root"
}

direct path: sekretbarilo check-file src/config.rs

Pipeline

parse stdin json payload (if --stdin-json):
- reads up to 1 MB from stdin (prevents unbounded memory)
- extracts file_path and optional cwd
resolve file path:
- absolute paths: computes relative path from cwd for better vendor/pattern detection
- relative paths: validates against path traversal (blocks ../../etc/passwd)
- returns (relative_path, base_dir)
validate base directory: checks base_dir.is_dir()
.env blocking: unconditional block for .env files (same policy as scan)
fast-path rejection (default allowlist only):
- uses hardcoded patterns (binary, vendor, lock files)
- skips config loading for obvious skips (performance optimization)
load hierarchical config:
- discovers configs from base_dir up to home directory
- merges all found configs
full fast-path rejection (with user config):
- applies user-defined allowlist paths
- applies audit exclude patterns
read file:
- converts to synthetic DiffFile (all lines treated as “added”)
- binary detection: first 8KB null byte check
- error handling: read errors block (fail closed)
compile scanner: builds CompiledScanner from merged rules
scan: runs through scanner engine (same pipeline as scan mode)

report findings (to stderr):

[AGENT] secret(s) detected in /path/to/file.rs
  line: 42
  rule: aws-access-key-id
  match: AK**************FG

exit code:
- 0 = clean (claude reads file)
- 2 = secrets found or error (claude blocks read)

Key Design: fail closed. errors (read failure, config error) exit 2 to prevent exposing secrets.

Configuration System

Hierarchical Discovery

searches for .sekretbarilo.toml in order (lowest → highest priority):

/etc/sekretbarilo/sekretbarilo.toml (system-wide)
~/.config/sekretbarilo/sekretbarilo.toml (user)
.sekretbarilo.toml in each directory from repo root → home

merge strategy:

scalars: highest priority wins (e.g., entropy_threshold)
lists: concatenated + deduplicated (e.g., stopwords, paths)
rules: same id overrides, new id appends

TOML Format

[settings]
entropy_threshold = 3.5

[allowlist]
paths = ["test/.*", "vendor/.*"]
stopwords = ["my-project-safe-token"]

[[allowlist.rules]]
id = "aws-access-key-id"
regexes = ["AKIAIOSFODNN7EXAMPLE"]
paths = ["test/.*"]

[audit]
include_ignored = true
exclude_patterns = ["^build/", "^dist/"]
include_patterns = ["\\.rs$"]

[[rules]]
id = "custom-token"
description = "Custom API token"
regex = "(CUSTOM_[A-Z]{10})"
secret_group = 1
keywords = ["custom_"]
entropy_threshold = 3.5

Rule Compilation

load default rules: embedded config/rules.toml (109 rules)
merge user rules: overrides by id, appends new rules
compile regexes: regex::bytes::RegexBuilder with 1 MB size limit
build aho-corasick automaton: all keywords (case-insensitive, deduplicated)
map keywords → rules: keyword_to_rules[pattern_idx] = [rule_idx, ...]

Result: CompiledScanner struct with automaton, keyword_to_rules, rules

Hook Installation

Pre-Commit Hook

local: .git/hooks/pre-commit

sekretbarilo install pre-commit

global: ~/.config/git/hooks/pre-commit + git config --global core.hooksPath

sekretbarilo install pre-commit --global

generated script:

#!/bin/sh
# sekretbarilo pre-commit hook
# resolve the sekretbarilo binary
SEKRETBARILO_BIN=""
if command -v sekretbarilo >/dev/null 2>&1; then
    SEKRETBARILO_BIN="sekretbarilo"
elif [ -x "$HOME/.cargo/bin/sekretbarilo" ]; then
    SEKRETBARILO_BIN="$HOME/.cargo/bin/sekretbarilo"
fi

if [ -n "$SEKRETBARILO_BIN" ]; then
    "$SEKRETBARILO_BIN" scan
    exit_code=$?
    if [ $exit_code -eq 1 ]; then
        exit 1
    elif [ $exit_code -ne 0 ]; then
        echo "[ERROR] sekretbarilo exited with code $exit_code" >&2
        exit $exit_code
    fi
else
    echo "[WARN] sekretbarilo not found in PATH or ~/.cargo/bin/, skipping secret scan" >&2
    echo "[WARN] install with: cargo install sekretbarilo" >&2
fi
# end sekretbarilo

features:

POSIX-compatible (no bashisms)
idempotent: detects existing installation via marker comment
appends to existing hooks (inserts before trailing exit)
graceful degradation: warns if binary not found but doesn’t fail
sets executable permission (chmod +x)

Agent Hook (Claude Code)

local: .claude/settings.json

sekretbarilo install agent-hook claude

global: ~/.claude/settings.json

sekretbarilo install agent-hook claude --global

settings.json structure:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Read",
        "hooks": [
          {
            "type": "command",
            "command": "sekretbarilo check-file --stdin-json",
            "timeout": 10,
            "statusMessage": "Scanning file for secrets..."
          }
        ]
      }
    ]
  }
}

features:

reads/creates .claude/settings.json or ~/.claude/settings.json
idempotent: detects existing entries, updates old command formats
preserves existing settings and other hook matchers
appends to existing Read matcher if present
atomic write: temp file + rename (prevents corruption)

Output & Masking

Masking Strategy

all secret values masked before display:

Length	Display
1-5 chars	`xxxxx` (fully masked)
6+ chars	`AB**********YZ` (first 2 + last 2)

rationale: prevents accidental exposure in logs/screenshots while allowing identification

Output Prefixes

[ERROR]: scan command findings (blocks commit)
[AUDIT]: audit command findings (informational)
[AGENT]: check-file command findings (blocks read)

Terminal Safety

history mode sanitizes all displayed fields:

strips control characters (\x00-\x1f, \x7f)
strips bidi overrides (\u{202A}-\u{202E})
prevents terminal injection via malicious git author/email/branch/file names

Design Decisions

Fail Closed

errors exit 2 (block) rather than 0 (allow):

config parse errors → block
file read errors → block
rule compilation errors → block

rationale: better to over-block than expose secrets

No Network

everything runs locally:

no telemetry
no remote rule updates
no API calls

rationale: privacy, security, offline support

POSIX Hooks

generated hooks use only POSIX shell features:

[ ] not [[ ]]
command -v not which
"$VAR" quoting everywhere

rationale: works on all shells (sh, bash, zsh, dash)

Graceful Degradation

hooks warn but don’t fail if binary not found:

[WARN] sekretbarilo not found in PATH or ~/.cargo/bin/, skipping secret scan

rationale: doesn’t break workflows if binary is temporarily missing

Stdin Limit

check-file limits stdin to 1 MB:

std::io::stdin().take(1_048_576).read_to_string(&mut input)

rationale: prevents unbounded memory consumption from malicious payloads

Regex Size Limit

all regexes compiled with 1 MB limit:

RegexBuilder::new(&pattern).size_limit(1 << 20).build()

rationale: prevents ReDoS (regular expression denial of service)

Parallel Processing

uses rayon for parallel file/commit processing:

threshold: 4+ files/commits
automatic work-stealing
respects available CPU cores

rationale: 10-100x speedup on large repos

Binary-Safe Scanning

uses regex::bytes and Vec<u8> everywhere:

handles non-UTF-8 files
no allocation for UTF-8 conversion

rationale: supports all file encodings

Performance Characteristics

Scan Mode (Staged Changes)

cold start: ~50-100ms (config load + rule compilation)
incremental: ~5-10ms per file
bottleneck: regex evaluation (mitigated by aho-corasick pre-filter)

Audit Mode (Working Tree)

small repos (< 1000 files): ~1-2 seconds
large repos (10,000+ files): ~10-30 seconds
bottleneck: file I/O (mitigated by parallel reads)

History Mode (All Commits)

small repos (< 100 commits): ~5-10 seconds
large repos (10,000+ commits): ~5-10 minutes
bottleneck: git diff-tree execution (mitigated by parallel processing)

Memory Usage

scan mode: ~5-10 MB
audit mode: ~50-100 MB (buffered file contents)
history mode: ~100-500 MB (buffered diffs + findings)

Optimization Techniques

aho-corasick pre-filter: reduces regex evaluations by 95%+
rayon parallelism: 10-100x speedup on multi-core
binary-safe bytes: no UTF-8 conversion overhead
reusable bitsets: avoids per-line allocations
early termination: skips binary/vendor/lock files immediately

Testing Strategy

Unit Tests

scanner engine: keyword matching, regex extraction, entropy calculation
diff parser: edge cases (binary, renames, root commits, multiple hunks)
config merging: scalar override, list concatenation, rule merging
allowlist compilation: path patterns, stopwords, per-rule allowlists

Integration Tests

default rules: all 109 rules compile and detect known secrets
hook installation: idempotency, appending, preservation of existing hooks
agent hook: JSON parsing, path resolution, fast-path rejection

Property-Based Testing

entropy calculation: monotonic increase with randomness
masking: never exposes full secret value
hash detection: SHA-1/SHA-256/MD5 lengths with context

Security Considerations

Threat Model

in-scope:

accidental secret commits (developer error)
copy-paste from documentation (example secrets)
weak passwords in config files

out-of-scope:

intentional malicious commits (insider threat)
encrypted/obfuscated secrets
secrets split across multiple lines

Attack Surface

malicious config files: TOML parsing uses serde_toml (memory-safe)
malicious git payloads: binary-safe parsing, control char stripping
ReDoS via user regexes: 1 MB size limit enforced
path traversal: canonicalization + prefix check in check-file mode
terminal injection: sanitizes author/email/branch/file names before display

Sandboxing

agent hook mode:

reads only the target file (no directory traversal)
no network access
no temp file creation
read-only access to config files

Future Work

Performance

incremental scanning (cache results per file hash)
rule priority ordering (check high-confidence rules first)
streaming diff parsing (avoid buffering full diffs)

Features

custom entropy models per rule (hex-only, base64-only)
machine learning-based false positive reduction
integration with secret management systems (vault, 1password)

Accuracy

context-aware scanning (understand variable assignments)
cross-file analysis (detect secrets split across imports)
semantic analysis (distinguish API keys from UUIDs)