r/computervision • u/pablocael • 6d ago

perceptual hashes) with Multi-Index Hashing that beats Faiss's own MIH

I built PyNear, an open-source (MIT) nearest-neighbor library with a C++ core and a NumPy-only Python API. Posting here because its sweet spot is binary descriptors — ORB, BRIEF, AKAZE, perceptual hashes — i.e. feature matching, image/video dedup, and copy detection.

What does pynear have:

- exact VP-Trees for low-to-mid dimensions

- IVF-Flat for high-dim float embeddings

- Multi-Index Hashing (MIH) + IVF for Hamming search on binary descriptors

The MIH part is why I'm posting. It splits each d-bit descriptor into m sub-strings and hashes them; by the pigeonhole principle, every neighbor within Hamming radius r is guaranteed to be found by probing sub-strings within radius floor([r/m](r/m)) — so on wide descriptors you skip the full O(N) scan.

Benchmarks (SIFT1M + synthetic, 24 threads, reproducible script in the repo)

- 512-bit near-duplicate retrieval: pynear's MIH hits 100% recall ~40x faster than Faiss's brute-force IndexBinaryFlat; Faiss's own IndexBinaryMultiHash isn't competitive at that width.

- SIFT1M 128-bit, matched recall: pynear's MIH is 1.3–1.6x faster than Faiss's MIH.

- Caveat I want to be upfront about: on narrow 128-bit descriptors at high recall, an optimized brute-force POPCNT scan (Faiss IndexBinaryFlat, ~22k QPS exact) beats MIH.

MIH earns its keep on wide descriptors and small-radius / near-duplicate workloads — not as a universal brute-force replacement.

One gotcha I hit and documented: pynear links libgomp and faiss-cpu links libomp; loading both in one process serializes Faiss's parallel flat scan, so I benchmark it in a separate process to keep the comparison fair.

Install: pip install pynear (wheels for Linux/macOS/Windows, no compiler, NumPy-only)

Repo + benchmarks: https://github.com/pablocael/pynear

Background write-up: https://medium.com/@pablo.cael/the-shared-recipe-behind-search-images-shazam-and-rag-08fc93a276ac

Would genuinely appreciate feedback on the methodology — and I'm curious what binary-descriptor workloads people here run (dedup, retrieval, SLAM loop closure?).

Happy to add index types if there's a real need.

6 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/computervision/comments/1tp3rqx/pynear_knn_for_binary_descriptors/
No, go back! Yes, take me to Reddit

100% Upvoted

View all comments

u/blimpyway 6d ago

Cool. Any tests against pynndescent?

1

u/pablocael 6d ago

Not yet. The benchmarks could be extended though.

Showcase PyNear – KNN for binary descriptors (ORB/BRIEF/perceptual hashes) with Multi-Index Hashing that beats Faiss's own MIH

You are about to leave Redlib