r/computervision • u/pablocael • 6d ago
Showcase PyNear – KNN for binary descriptors (ORB/BRIEF/perceptual hashes) with Multi-Index Hashing that beats Faiss's own MIH
I built PyNear, an open-source (MIT) nearest-neighbor library with a C++ core and a NumPy-only Python API. Posting here because its sweet spot is binary descriptors — ORB, BRIEF, AKAZE, perceptual hashes — i.e. feature matching, image/video dedup, and copy detection.
What does pynear have:
- exact VP-Trees for low-to-mid dimensions
- IVF-Flat for high-dim float embeddings
- Multi-Index Hashing (MIH) + IVF for Hamming search on binary descriptors
The MIH part is why I'm posting. It splits each d-bit descriptor into m sub-strings and hashes them; by the pigeonhole principle, every neighbor within Hamming radius r is guaranteed to be found by probing sub-strings within radius floor([r/m](r/m)) — so on wide descriptors you skip the full O(N) scan.
Benchmarks (SIFT1M + synthetic, 24 threads, reproducible script in the repo)
- 512-bit near-duplicate retrieval: pynear's MIH hits 100% recall ~40x faster than Faiss's brute-force IndexBinaryFlat; Faiss's own IndexBinaryMultiHash isn't competitive at that width.
- SIFT1M 128-bit, matched recall: pynear's MIH is 1.3–1.6x faster than Faiss's MIH.
- Caveat I want to be upfront about: on narrow 128-bit descriptors at high recall, an optimized brute-force POPCNT scan (Faiss IndexBinaryFlat, ~22k QPS exact) beats MIH.
MIH earns its keep on wide descriptors and small-radius / near-duplicate workloads — not as a universal brute-force replacement.
One gotcha I hit and documented: pynear links libgomp and faiss-cpu links libomp; loading both in one process serializes Faiss's parallel flat scan, so I benchmark it in a separate process to keep the comparison fair.
Install: pip install pynear (wheels for Linux/macOS/Windows, no compiler, NumPy-only)
Repo + benchmarks: https://github.com/pablocael/pynear
Background write-up: https://medium.com/@pablo.cael/the-shared-recipe-behind-search-images-shazam-and-rag-08fc93a276ac
Would genuinely appreciate feedback on the methodology — and I'm curious what binary-descriptor workloads people here run (dedup, retrieval, SLAM loop closure?).
Happy to add index types if there's a real need.
2
u/blimpyway 6d ago
Cool. Any tests against pynndescent?