Conversation
PR ReviewP0: Bug —
|
I feel like a very efficient way would be to tokenize in parallel, and then collect all tokenized data and compute in the end. IIRC tokenization is usually the bottleneck, and I'd imagine the tokenized data is smaller than the original text, especially if you are able to filter out tokens that aren't relevant to the query. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This should be doable! We can tokenize and discard all tokens that aren't in the query and then accumulate. We might also need to accumulate token counts for each row but that should be small too. Maybe for a first pass, if we exceed some limit (e.g. 1GB) then we log a warning and just keep going (eventually OOM). The warning could be something like... |
|
Both of claude's suggestions are valid. I will revisit this weekend / Monday when I have time to add some regression tests for these cases. |
This adds various performance improvements to the flat FTS search. The most significant improvement is that it parallelizes the search.
This does have some impact to accuracy. To calculate bm25 we typically need to make two passes through the data. The first to count token frequencies and the second to count token scores. The current implementation avoids this by using "token frequency so far" when calculating the bm25 score. This is generally accurate when there is a lot of indexed data and a small amount of unindexed data because the "token frequency so far" gets bootstrapped by the frequencies from the index and so the effect of the unindexed frequencies are minimal.
However, this can be more significant when there is no index or the unindexed data makes up a significant portion of the data. In that case the "token frequency so far" can be quite inaccurate for the first few documents.
In parallelizing this search we make this problem worse since each thread is calculating its own independent "token frequency so far" and it will take longer for each one to arrive at a more accurate result.
The most accurate approach would probably be to just accumulate all data in memory, tokenize (in parallel), count token frequencies (back to serial), then calculate scores (in parallel). This does run the risk of accumulating too much data however.
Another alternative could be to accumulate up to some amount (e.g. 100MB), calculate initial token frequencies, and then parallelize the rest of the search using those initial token frequencies. I'm open to suggestions. In the meantime we could probably proceed with this PR as-is.
In addition to the parallelization this PR also makes various changes to the algorithm itself to avoid string copies. This cuts down on the CPU time by 5x on my system.