dbuckman
8 hours ago
Hi HN,
I’ve spent a lot of time working on data pipelines, and one of the most frustrating problems is accidentally syncing PII or developer secrets (like AWS keys or SSNs) into a data warehouse or downstream system.
Most of the enterprise tools that solve this are either massive Java applications, require complex Python environments, or cost $50k/year. I just wanted a lightning-fast, single binary I could drop into a CI/CD pipeline (--fail-on-pii) or run locally against a Postgres DB to see my exposure. So, I built pii-hound.
A few technical details on how it works under the hood:
Memory Efficiency: Scanning a 50GB CSV file shouldn't cause an OOM error. It uses a concurrent, streaming architecture and implements Reservoir Sampling so it can sample huge datasets sequentially while maintaining randomness and a tiny memory footprint.
Speed: For the keyword and column-name heuristics, I implemented Aho-Corasick string matching, which is significantly faster than running dozens of individual regexes against every header.
Accuracy: To cut down on false positives, things like Credit Card numbers don't just use regex; they are piped through a Luhn algorithm validation step.
Full transparency: I originally wrote the core of this scanning engine for a larger data management platform I’m building called Saddle Data. But I realized the scanner itself is incredibly useful as a standalone utility, so I extracted it, polished the CLI, and open-sourced it under the MIT license.
It currently supports Postgres, MySQL, Snowflake, BigQuery, SQLite, S3, GCS, and local files (CSV/JSON/Parquet).
I'd love for you to point it at a local database or a messy CSV and let me know how it performs. Happy to answer any questions about the Go implementation, and PRs for new regex rules or source connectors are very welcome!