Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue in IndexFlatIP #4121

Open
2 of 4 tasks
Museum7432 opened this issue Jan 9, 2025 · 5 comments
Open
2 of 4 tasks

Performance issue in IndexFlatIP #4121

Museum7432 opened this issue Jan 9, 2025 · 5 comments

Comments

@Museum7432
Copy link

Museum7432 commented Jan 9, 2025

Summary

IndexFlatIP doesn't seem to use parallelization correctly, as my custom flat index is somehow 2x to 3x (depends on the number of threads) faster than IndexFlatIP when I wrote a benchmark to compare the two. I kind of expected mine to be slower than Faiss as it is the same under the hood with fewer optimizations, not faster like this.

Platform

OS: Linux

Faiss version: 1.9.0

Installed from: faiss-cpu from anaconda and pip

Running on:

  • CPU
  • GPU

Interface:

  • C++
  • Python

Reproduction instructions

Here is the colab notebook for reproducing the problem.
It could be caused by the memory access pattern in IndexFlatIP (like why does the function fvec_inner_products_by_idx parallelize over the queries not ids, I can't read the code for dispatch_knn_ResultHandler though, so i can't say it is caused by it).
A possible explanation could be that processing each query in separate threads would mean reading the whole index into the cache of each core every search, and creates a bottleneck. And my implementation, instead of doing that, processes all the queries and a segment of the index in each thread, and does not require that much memory bandwidth.
This issue becomes really noticeable on machines with high core count.
Here is how IndexFlatIP scales with number of threads (ntotal=1000000, d=512, n_queries=100, topk=5):

n_threads IndexFlatIP custom
1 1.3 s 3.7 s
2 1.3 s 1.7 s
4 1.3 s 0.9 s
8 1.5 s 0.7 s
12 1.6 s 0.65 s
@pankajsingh88
Copy link
Contributor

Hi Museum7432! I don't have access to the colab link so can't check your setup.
As such do note that FAISS uses OpenMP internally for cpu parallelization and hence is limited by OpenMP's performance. We too have noticed in the past that for some indexes, workloads and hardware types, running multiple processes (or threads) of faiss and setting omp_threads to 1 produces higher overall throughput. Most likely you're witnessing the same. More on this is documented on faiss wiki: https://github.com/facebookresearch/faiss/wiki/Threads-and-asynchronous-calls.

@Museum7432
Copy link
Author

Museum7432 commented Jan 15, 2025

Hi, @pankajsingh88. Weird, I can access that notebook normally even in a private tab.

We too have noticed in the past that for some indexes, workloads and hardware types, running multiple processes (or threads) of faiss and setting omp_threads to 1 produces higher overall throughput.

You're right, the issue was mainly caused by how Faiss cannot scale well with the number of threads. And after reading the faiss code, the cause is mainly that in IndexFlatIP (and similar flat indexes) it processes each query separately in each thread. Which means that each core will have to read the whole index from ram to its cache (each core has its own cache in a multicore system) which doesn't scale well with the number of threads.
My custom implementation is less optimized (slower with 1 or 2 threads) but it scales better as it processes each chunk of the index in separate threads (imagine 16gb of data read by 64 threads vs each thread only reads 256mb).

As such do note that FAISS uses OpenMP internally for cpu parallelization and hence is limited by OpenMP's performance

I suppose it was the memory bandwidth bottleneck that slowed faiss down in large indexes with more than 2 threads. I had tested my implementation in the VBS competition (~3 million vectors/keyframes with 1024 dimensions), and the average latency scaled linearly with the number of threads: ~1.7s for 50 queries with 8 threads. On the same dataset, faiss latency was a constant 7.5s w.r.t the number of threads. So I don't think Faiss performance is limited by OpenMP's performance, but more on the memory bandwidth between cpu and ram, and my implementation simply requires less bandwidth.

I also had similar performance degradation when switching to processing each query in separate threads, similar to Faiss. So it is most likely the cause.

Here is my search function if you can't access the notebook.

@pankajsingh88
Copy link
Contributor

pankajsingh88 commented Jan 15, 2025

Thanks for your code pointer link! Can you point me to the code location in faiss where you see the queries being iterated? While I follow the index copy part in theory, am not able to locate it in FAISS.
Based on my code lookup, the distance calculation happen here where the ip blocks (ip_line) should be distributed to different cores.

@Museum7432
Copy link
Author

Museum7432 commented Jan 21, 2025

Sorry for the late reply, as it turned out, I was looking at the fvec_inner_products_by_idx function which does iterate over the queries.
The actual function exhaustive_L2sqr_blas_default_impl uses cblas_sgemm which should optimize memory access. However, the bug seems to be caused by how the sgemm function does multi-threading when the number of queries >= 20.

Memory bandwidth usage of faiss with 19 queries (in GB/s), the number of threads used increase over time, the first segment is the initialization of the index (start at the 6th second):

Image

The same benchmark but with 20 queries:

Image

With 19 queries, faiss seems to use the whole memory bandwidth and scale with the number of threads. But weirdly, with 20 queries, it only uses a constant 5 GB/s.

For reference, the same benchmark but with my custom implementation (which uses single-threaded cblas_sgemm in an openMP loop)

Image

n_threads 19 queries 20 queries mine
1 2.2 s 0.5 s 0.5 s
2 1.6 s 0.5 s 0.32 s
4 1.2 s 0.5 s 0.26 s
8 0.7 s 0.6 s 0.24 s

Copy link

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants