Performance issue in IndexFlatIP #4121

Museum7432 · 2025-01-09T16:38:00Z

Summary

IndexFlatIP doesn't seem to use parallelization correctly, as my custom flat index is somehow 2x to 3x (depends on the number of threads) faster than IndexFlatIP when I wrote a benchmark to compare the two. I kind of expected mine to be slower than Faiss as it is the same under the hood with fewer optimizations, not faster like this.

Platform

OS: Linux

Faiss version: 1.9.0

Installed from: faiss-cpu from anaconda and pip

Running on:

CPU
GPU

Interface:

C++
Python

Reproduction instructions

Here is the colab notebook for reproducing the problem.
It could be caused by the memory access pattern in IndexFlatIP (like why does the function fvec_inner_products_by_idx parallelize over the queries not ids, I can't read the code for dispatch_knn_ResultHandler though, so i can't say it is caused by it).
A possible explanation could be that processing each query in separate threads would mean reading the whole index into the cache of each core every search, and creates a bottleneck. And my implementation, instead of doing that, processes all the queries and a segment of the index in each thread, and does not require that much memory bandwidth.
This issue becomes really noticeable on machines with high core count.
Here is how IndexFlatIP scales with number of threads (ntotal=1000000, d=512, n_queries=100, topk=5):

n_threads	IndexFlatIP	custom
1	1.3 s	3.7 s
2	1.3 s	1.7 s
4	1.3 s	0.9 s
8	1.5 s	0.7 s
12	1.6 s	0.65 s

The text was updated successfully, but these errors were encountered:

pankajsingh88 · 2025-01-14T23:53:13Z

Hi Museum7432! I don't have access to the colab link so can't check your setup.
As such do note that FAISS uses OpenMP internally for cpu parallelization and hence is limited by OpenMP's performance. We too have noticed in the past that for some indexes, workloads and hardware types, running multiple processes (or threads) of faiss and setting omp_threads to 1 produces higher overall throughput. Most likely you're witnessing the same. More on this is documented on faiss wiki: https://github.com/facebookresearch/faiss/wiki/Threads-and-asynchronous-calls.

Museum7432 · 2025-01-15T06:06:13Z

Hi, @pankajsingh88. Weird, I can access that notebook normally even in a private tab.

We too have noticed in the past that for some indexes, workloads and hardware types, running multiple processes (or threads) of faiss and setting omp_threads to 1 produces higher overall throughput.

You're right, the issue was mainly caused by how Faiss cannot scale well with the number of threads. And after reading the faiss code, the cause is mainly that in IndexFlatIP (and similar flat indexes) it processes each query separately in each thread. Which means that each core will have to read the whole index from ram to its cache (each core has its own cache in a multicore system) which doesn't scale well with the number of threads.
My custom implementation is less optimized (slower with 1 or 2 threads) but it scales better as it processes each chunk of the index in separate threads (imagine 16gb of data read by 64 threads vs each thread only reads 256mb).

As such do note that FAISS uses OpenMP internally for cpu parallelization and hence is limited by OpenMP's performance

I suppose it was the memory bandwidth bottleneck that slowed faiss down in large indexes with more than 2 threads. I had tested my implementation in the VBS competition (~3 million vectors/keyframes with 1024 dimensions), and the average latency scaled linearly with the number of threads: ~1.7s for 50 queries with 8 threads. On the same dataset, faiss latency was a constant 7.5s w.r.t the number of threads. So I don't think Faiss performance is limited by OpenMP's performance, but more on the memory bandwidth between cpu and ram, and my implementation simply requires less bandwidth.

I also had similar performance degradation when switching to processing each query in separate threads, similar to Faiss. So it is most likely the cause.

Here is my search function if you can't access the notebook.

pankajsingh88 · 2025-01-15T23:45:44Z

Thanks for your code pointer link! Can you point me to the code location in faiss where you see the queries being iterated? While I follow the index copy part in theory, am not able to locate it in FAISS.
Based on my code lookup, the distance calculation happen here where the ip blocks (ip_line) should be distributed to different cores.

Museum7432 · 2025-01-21T12:42:07Z

Sorry for the late reply, as it turned out, I was looking at the fvec_inner_products_by_idx function which does iterate over the queries.
The actual function exhaustive_L2sqr_blas_default_impl uses cblas_sgemm which should optimize memory access. However, the bug seems to be caused by how the sgemm function does multi-threading when the number of queries >= 20.

Memory bandwidth usage of faiss with 19 queries (in GB/s), the number of threads used increase over time, the first segment is the initialization of the index (start at the 6th second):

The same benchmark but with 20 queries:

With 19 queries, faiss seems to use the whole memory bandwidth and scale with the number of threads. But weirdly, with 20 queries, it only uses a constant 5 GB/s.

For reference, the same benchmark but with my custom implementation (which uses single-threaded cblas_sgemm in an openMP loop)

n_threads	19 queries	20 queries	mine
1	2.2 s	0.5 s	0.5 s
2	1.6 s	0.5 s	0.32 s
4	1.2 s	0.5 s	0.26 s
8	0.7 s	0.6 s	0.24 s

github-actions · 2025-01-29T01:58:00Z

This issue is stale because it has been open for 7 days with no activity.

mengdilin added Performance unconfirmed-bug labels Jan 9, 2025

pankajsingh88 added the autoclose label Jan 14, 2025

github-actions bot added the stale label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue in IndexFlatIP #4121

Performance issue in IndexFlatIP #4121

Museum7432 commented Jan 9, 2025 •

edited

Loading

pankajsingh88 commented Jan 14, 2025

Museum7432 commented Jan 15, 2025 •

edited

Loading

pankajsingh88 commented Jan 15, 2025 •

edited

Loading

Museum7432 commented Jan 21, 2025 •

edited

Loading

github-actions bot commented Jan 29, 2025

Performance issue in IndexFlatIP #4121

Performance issue in IndexFlatIP #4121

Comments

Museum7432 commented Jan 9, 2025 • edited Loading

Summary

Platform

Reproduction instructions

pankajsingh88 commented Jan 14, 2025

Museum7432 commented Jan 15, 2025 • edited Loading

pankajsingh88 commented Jan 15, 2025 • edited Loading

Museum7432 commented Jan 21, 2025 • edited Loading

github-actions bot commented Jan 29, 2025

Museum7432 commented Jan 9, 2025 •

edited

Loading

Museum7432 commented Jan 15, 2025 •

edited

Loading

pankajsingh88 commented Jan 15, 2025 •

edited

Loading

Museum7432 commented Jan 21, 2025 •

edited

Loading