-
Notifications
You must be signed in to change notification settings - Fork 6
MB-69881: Re-architect vector search #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: bleve
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR re-architects vector search by replacing key-based list retrieval with bitmap-based vector counting for IVF indexes. It removes the obsolete faiss_SearchParametersIVF_new_with_sel and faiss_get_lists_for_keys APIs, replacing them with count_ivf_list_vectors which efficiently counts vectors per IVF list using bitmap iteration instead of key materialization.
Key Changes:
- New
count_ivf_list_vectorsmethod that iterates through IDSelectorBitmap to count vectors per IVF list - Removal of
get_lists_for_keysandfaiss_SearchParametersIVF_new_with_selAPIs - Type signature change for
faiss_Search_closest_eligible_centroidsfromFaissIndex*toconst FaissIndexIVF*
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| faiss/IndexIVF.h | Adds declaration and documentation for new count_ivf_list_vectors method; removes get_lists_for_keys declaration |
| faiss/IndexIVF.cpp | Implements count_ivf_list_vectors with bitmap iteration logic; removes get_lists_for_keys implementation |
| c_api/IndexIVF_c_ex.h | Updates C API: removes faiss_SearchParametersIVF_new_with_sel and faiss_get_lists_for_keys; adds faiss_count_ivf_list_vectors; updates faiss_Search_closest_eligible_centroids signature |
| c_api/IndexIVF_c_ex.cpp | Implements C API wrapper for count_ivf_list_vectors; removes implementations of deleted functions; updates faiss_Search_closest_eligible_centroids to use const cast |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| FAISS_THROW_MSG("count_ivf_list_vectors supports only IDSelectorBitmap"); | ||
| } | ||
| const uint8_t* bitmap = bitmap_sel->bitmap; | ||
| const size_t nbytes = bitmap_sel->n; |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a semantic mismatch between how the bitmap size is interpreted. According to IDSelectorBitmap's documentation and constructor (line 107 in IDSelector.h), the parameter 'n' represents "size of the bitmap array" (in bytes). However, on line 953, 'nbytes' is assigned from 'bitmap_sel->n', treating it as the number of bytes to iterate.
Looking at the is_member implementation in IDSelector.cpp line 120, it checks "if ((i >> 3) >= n)" which confirms that 'n' is indeed the number of bytes in the bitmap array. This means the implementation is correct.
However, there's potential confusion: if a caller constructs an IDSelectorBitmap with a bitmap sized for ntotal IDs (requiring ceil(ntotal/8) bytes), but accidentally passes a different value for n, this function would either under-iterate (missing valid IDs) or over-iterate (checking bits beyond the intended range). The safety check at line 966 mitigates over-iteration but doesn't detect under-iteration.
Consider adding an assertion or check that verifies: nbytes >= (ntotal + 7) / 8 to ensure the bitmap is large enough to cover all vectors in the index.
| const size_t nbytes = bitmap_sel->n; | |
| const size_t nbytes = bitmap_sel->n; | |
| const size_t min_bytes = (ntotal + 7) / 8; | |
| FAISS_ASSERT_MSG( | |
| nbytes >= min_bytes, | |
| "IDSelectorBitmap too small for index ntotal; bitmap size (bytes) " | |
| "must be >= ceil(ntotal/8)"); |
| * @param list_counts - Output array of size list_counts_size (must be == nlist). | ||
| * On return, list_counts[i] contains the number | ||
| * of selected vectors assigned to IVF list i. | ||
| * Must be initialized to zero by the caller. |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation comment for the 'list_counts' parameter states "Must be initialized to zero by the caller." However, the implementation uses the '++' operator (line 973 in IndexIVF.cpp) which increments the existing value. This means the assertion should verify that the array is zero-initialized, or the documentation should be updated to clarify that this is a requirement. Without verifying this precondition, if the caller passes an uninitialized or non-zero array, the results will be incorrect.
| * Must be initialized to zero by the caller. | |
| * Precondition: every element of list_counts must | |
| * be initialized to 0 by the caller before | |
| * calling this function; otherwise the resulting | |
| * per-list counts will be incorrect. |
| @param list_counts - Output array of size index->nlist. | ||
| On return, list_counts[i] contains the number | ||
| of selected vectors assigned to IVF list i. |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation states "Must be initialized to zero by the caller" but there is no corresponding note in the C API documentation. The C API documentation at lines 40-42 should also mention this requirement: "Output array of size index->nlist. Must be initialized to zero by the caller. On return, list_counts[i] contains the number of selected vectors assigned to IVF list i."
| @param list_counts - Output array of size index->nlist. | |
| On return, list_counts[i] contains the number | |
| of selected vectors assigned to IVF list i. | |
| @param list_counts - Output array of size index->nlist. Must be | |
| initialized to zero by the caller. On return, | |
| list_counts[i] contains the number of selected | |
| vectors assigned to IVF list i. |
| */ | ||
| int faiss_Search_closest_eligible_centroids( | ||
| FaissIndex* index, | ||
| const FaissIndexIVF* index, |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function signature changed the index parameter from FaissIndex* to const FaissIndexIVF*. This is a breaking API change that will cause compilation errors for existing code that passes a FaissIndex*. While the change makes sense (the function specifically needs an IVF index and doesn't modify it), this should be clearly documented as a breaking change in the PR description or release notes to warn API users.
| const FaissIndexIVF* index, | |
| const FaissIndex* index, |
| void IndexIVF::count_ivf_list_vectors( | ||
| idx_t* list_counts, | ||
| size_t list_counts_size, | ||
| const faiss::SearchParameters* params) const { | ||
| FAISS_ASSERT(list_counts != nullptr && list_counts_size > 0 && list_counts_size == nlist); | ||
| FAISS_ASSERT(params != nullptr && params->sel != nullptr); | ||
| FAISS_ASSERT(direct_map.type != DirectMap::NoMap); | ||
| const IDSelector* sel = params->sel; | ||
| // Optimized for bitmap selectors only | ||
| const IDSelectorBitmap* bitmap_sel = dynamic_cast<const IDSelectorBitmap*>(sel); | ||
| if (!bitmap_sel) { | ||
| FAISS_THROW_MSG("count_ivf_list_vectors supports only IDSelectorBitmap"); | ||
| } | ||
| const uint8_t* bitmap = bitmap_sel->bitmap; | ||
| const size_t nbytes = bitmap_sel->n; | ||
| // Iterate over bitmap bytes | ||
| for (size_t byte_idx = 0; byte_idx < nbytes; ++byte_idx) { | ||
| uint8_t byte = bitmap[byte_idx]; | ||
| if (byte == 0) { | ||
| continue; // fast skip | ||
| } | ||
| // Iterate over bits in the byte | ||
| for (uint8_t bit = 0; bit < 8; ++bit) { | ||
| if ((byte & (1 << bit)) == 0) { | ||
| continue; | ||
| } | ||
| idx_t id = (byte_idx << 3) + bit; | ||
| if (id >= ntotal) { | ||
| continue; // Safety check: skip invalid ids | ||
| } | ||
| uint64_t list_no = lo_listno(direct_map.get(id)); | ||
| if (list_no >= nlist) { | ||
| continue; // Safety check: skip invalid list numbers | ||
| } | ||
| list_counts[list_no]++; | ||
| } | ||
| } | ||
| } |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new function count_ivf_list_vectors lacks test coverage. Given that the repository has comprehensive testing (e.g., test_lowlevel_ivf.cpp, test_ivf_index.cpp), this new functionality should be tested to ensure it correctly counts vectors per IVF list when using IDSelectorBitmap, handles edge cases (empty bitmaps, invalid IDs, etc.), and validates the preconditions (direct_map set, proper bitmap size, etc.).
| idx_t* list_counts, | ||
| size_t list_counts_size, | ||
| const faiss::SearchParameters* params) const { | ||
| FAISS_ASSERT(list_counts != nullptr && list_counts_size > 0 && list_counts_size == nlist); |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion combines multiple conditions with logical AND, making it harder to debug when it fails. If any of these conditions fails, the user won't know which specific condition caused the assertion. Consider splitting this into separate assertions with descriptive messages:
FAISS_ASSERT(list_counts != nullptr);
FAISS_ASSERT(list_counts_size > 0);
FAISS_ASSERT(list_counts_size == nlist);
Or at minimum, use FAISS_THROW_IF_NOT_MSG for better error messages.
| FAISS_ASSERT(list_counts != nullptr && list_counts_size > 0 && list_counts_size == nlist); | |
| FAISS_ASSERT(list_counts != nullptr); | |
| FAISS_ASSERT(list_counts_size > 0); | |
| FAISS_ASSERT(list_counts_size == nlist); |
faiss_SearchParametersIVF_new_with_selAPI and consolidate eligible-vector handling through bitmap-based selectors.faiss_get_lists_for_keyswithfaiss_count_ivf_list_vectors, which efficiently iterates the eligible vector bitmap (IDSelectorBitmap) to count per-IVF-list vector membership, avoiding key materialization and enabling direct centroid eligibility computation in the storage layer.