The Problem That Wouldn’t Stop Bugging Me
If you’ve worked in any engineering role long enough, you’ve seen it happen: someone commits an AWS key to a repo. A .env file with production database credentials gets pushed. A Slack token shows up in a config file that was “only supposed to be local.”
I saw this constantly in my work as a performance test engineer. API keys scattered across JMeter scripts. Database connection strings hardcoded in test configs. Tokens embedded in CI/CD pipelines. And every time, the response was the same — somebody caught it (hopefully), rotated the credentials (hopefully quickly), and said “we should really scan for this stuff.”
So I built the thing.
What Is CredVigil?
CredVigil is an open-source secrets scanner written in Go. It scans codebases, config files, git history, and live file changes for exposed credentials — API keys, tokens, passwords, private keys, connection strings — across 75+ platforms and services.
But the part I’m most proud of isn’t the number of rules. It’s how it detects secrets.
Why Regex Alone Isn’t Enough
Most secrets scanners rely on regex patterns: if a string matches AKIA[0-9A-Z]{16}, it’s probably an AWS access key. That works great for well-known formats. But what about:
- A custom internal API key that doesn’t match any known pattern?
- A base64-encoded token that looks like random noise?
- A password stored as
db_pass=Xk9$mP2!qR7@nL4— no standard format, pure entropy?
Regex misses all of these. That’s why CredVigil uses triple detection.
The Three Signals
| Method | What it does | What it catches |
|---|---|---|
| Regex (369 rules) | Pattern matching against compiled regexes for every known credential format | AWS, GCP, Azure, GitHub, Stripe, OpenAI, Slack, JWT, private keys, DB URIs, 75+ categories |
| Shannon Entropy | Measures information density of a string; high randomness = likely a secret | Custom API keys, random tokens, high-entropy strings with no known format |
| BPE Token Efficiency | Measures how poorly a string compresses under Byte Pair Encoding; secrets compress poorly | Passwords, opaque tokens, anything that looks random to a language model tokenizer |
These three signals combine into a confidence score from 0–100% per finding. Not a binary “this is a secret.” A score that lets you set thresholds and eliminate noise.
Zero-Trust by Design
Here’s something that always bothered me about other tools: they store the raw secret in their output. If your scanner’s report file leaks, you’ve now leaked the secrets again.
CredVigil never stores raw secrets. Every finding includes:
- A SHA-256 hash (for deduplication and tracking, not reversibility)
- A redacted preview — e.g.
wJal****EKEY - A fingerprint for cross-referencing across scans
- Severity, confidence score, file type, environment classification
The raw match exists only in memory during the scan. A five-stage post-processing pipeline runs on every finding before it reaches output:
// Five-stage secure pipeline — no raw secrets leave this sequence
Each stage is a pure function that transforms a finding. Easy to test, easy to reason about, easy to extend. If your scanner’s report file leaks, you haven’t leaked the secrets again. That was a non-negotiable design goal.
Architecture: Five Components, Each Independently Tested
| Component | What it does |
|---|---|
| Core Detection Engine | Regex + entropy + BPE scanning, concurrent file processing, confidence scoring |
| Secure Pipeline | Hash → redact → enrich → fingerprint → sanitize (5-stage post-processing) |
| Git Integration | Clone repos, walk commit history, diff branches, incremental scanning |
| File System Watcher | Real-time monitoring with fsnotify, debounced events, smart exclusions |
| Event Bus | Internal pub/sub for decoupled communication between components |
Each component has its own test suite. The whole system passes with Go’s race detector enabled. There are 14 end-to-end tests covering real-world scanning scenarios.
One thing I learned: building components in isolation and testing them separately made the whole system dramatically easier to debug. When something broke, I knew exactly which layer to look at.
Try It
Here’s what a real scan looks like — three runs, three different severity thresholds:
What I Learned Building This
- Regex is deep. Writing 369 detection rules that are precise enough to catch real secrets but not so greedy they flag every string that starts with
sk_taught me more about regex than years of using them casually. - Entropy isn’t magic. Shannon entropy is a powerful signal, but it’s not sufficient alone. Base64-encoded JSON has high entropy but isn’t a secret. You need context — keyword proximity, file type, pattern structure — to make entropy useful. That’s why it’s one signal among several in the confidence score, not the whole answer.
- Concurrency in Go is a superpower. Scanning thousands of files needs to be fast. Go’s goroutines and channels made it natural to process files concurrently while keeping the race detector happy. Getting to zero race conditions with
go test -racewas satisfying. - The pipeline pattern is underrated. Hash → redact → enrich → fingerprint → sanitize. Each stage is a pure function that transforms a finding. Easy to test, easy to reason about, easy to extend.
- False positives are the real enemy. A scanner that fires on every string that looks vaguely like a secret is useless. Detecting placeholders (
EXAMPLE,changeme,TODO), penalizing test fixtures, and computing confidence scores were just as important as the detection rules themselves.
What’s Next
CredVigil currently has 5 core components fully built and tested. The roadmap includes:
- REST API server for CI/CD integrations
- Web dashboard for visual risk overview
- GitHub Actions pre-commit hook
- Slack / email / webhook notifications
- ML-based anomaly detection for catching secret patterns no regex could
View CredVigil on GitHub
Open source, Apache 2.0 licensed. PRs, issues, and feedback welcome.