-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential bug with 0.6.0 shared storage_context and vector store retrieval #1769
Comments
Adding to this. This impacts every shared storage context with simple vector stores. |
Thanks for flagging, this is a great point. Yeah I think at the moment we get around this by instantiating a "new" DocumentStore in However if you do specify persist_dir, then simple vector stores may have this problem. This is a TODO for us to add namespace support cc @Disiok |
I'm facing the same problem. I'm looking for the update |
@khiemledev This is not a fix, but you can make each sub index and save it on its own, and then create the graph on the fly when needed. Not an ideal solution, but it works for now |
For me, I have each index saved on a different persist_dir. I can query each index and it works just fine. But after I add_nodes to an index loaded from storage, then query again, I got the error |
@Disiok @jerryjliu any chance this was fixed recently? Reading through the commits, can’t see anything specific to this |
There's an initial fix (more of a bandaid for now). We pass the list of node ids from the vector index to the retriever, so it knows which subset of nodes. def as_retriever(self, **kwargs: Any) -> BaseRetriever:
# NOTE: lazy import
from llama_index.indices.vector_store.retrievers import VectorIndexRetriever
return VectorIndexRetriever(
self, doc_ids=list(self.index_struct.nodes_dict.values()), **kwargs
) |
(in case others are wondering as I was) the PR seems to be #3695 |
Hey friends!
There's a very good chance that I'm misunderstanding the philosophy with this new refactor, but if I'm not, there may be a bug with how the vector store retriever works.
My understanding is that the storage context is something we can pass to multiple indices to reuse nodes and vectors (coo coo, very nice). However, looking at the retrieval method for the vector stores, there's nothing preventing a node that exists in different indices to be picked up as the closest match (https://github.com/jerryjliu/llama_index/blob/c5d8768f5d0e5789e977c474457b2634f452957e/gpt_index/indices/vector_store/retrievers.py#L73)
Should there exist a node level check to ensure that the specific nodes exist for that given index? From what I gather, there's a filter on the document level, but if a document was parsed differently for different indices, these nodes would have different node ids with the same document id?
This would lead to errors such as:
The text was updated successfully, but these errors were encountered: