Show HN: RAG-corpus-profiler – A linter for RAG datasets (dedup, PII, quality)

1 pointsposted 11 hours ago
by aashirpersonal

1 Comments

aashirpersonal

11 hours ago

Hi HN,

I’ve been building RAG systems for a while, and I noticed 90% of retrieval failures aren't due to the LLM—they're due to the data. I got tired of debugging hallucinations only to find the retriever had pulled "Page 1 of 5" headers or five duplicate versions of an old policy.

I couldn't find a simple "pandas-profiling" equivalent for unstructured text, so I built this.

It runs locally (CLI) and helps you:

Detect semantic duplicates (using all-MiniLM-L6-v2) to save vector storage costs.

Flag PII (API keys, emails) before they get indexed.

Identify "coverage gaps" by comparing user queries against your docs.

It outputs a standalone HTML report you can show to stakeholders.

Written in Python, open source (MIT). Feedback welcome!

https://github.com/aashirpersonal/rag-corpus-profiler