Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug with 0.6.0 shared storage_context and vector store retrieval #1769

Closed
decent-engineer-decent-datascientist opened this issue May 1, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@decent-engineer-decent-datascientist

Hey friends!

There's a very good chance that I'm misunderstanding the philosophy with this new refactor, but if I'm not, there may be a bug with how the vector store retriever works.

My understanding is that the storage context is something we can pass to multiple indices to reuse nodes and vectors (coo coo, very nice). However, looking at the retrieval method for the vector stores, there's nothing preventing a node that exists in different indices to be picked up as the closest match (https://github.com/jerryjliu/llama_index/blob/c5d8768f5d0e5789e977c474457b2634f452957e/gpt_index/indices/vector_store/retrievers.py#L73)

Should there exist a node level check to ensure that the specific nodes exist for that given index? From what I gather, there's a filter on the document level, but if a document was parsed differently for different indices, these nodes would have different node ids with the same document id?

This would lead to errors such as:

     78     raise ValueError(
     79         "Vector store query result should return at "
     80         "least one of nodes or ids."
     81     )
     82 assert isinstance(self._index.index_struct, IndexDict)
---> 83 node_ids = [
     84     self._index.index_struct.nodes_dict[idx] for idx in query_result.ids
     85 ]
     86 nodes = self._docstore.get_nodes(node_ids)
     87 query_result.nodes = nodes

File ~/miniconda3/lib/python3.10/site-packages/llama_index/indices/vector_store/retrievers.py:84, in <listcomp>(.0)
     78     raise ValueError(
     79         "Vector store query result should return at "
     80         "least one of nodes or ids."
     81     )
     82 assert isinstance(self._index.index_struct, IndexDict)
     83 node_ids = [
---> 84     self._index.index_struct.nodes_dict[idx] for idx in query_result.ids
     85 ]
     86 nodes = self._docstore.get_nodes(node_ids)
     87 query_result.nodes = nodes

KeyError: '9e513d49-4c87-49f1-ba2f-ef9003529145'```
@decent-engineer-decent-datascientist

Adding to this. This impacts every shared storage context with simple vector stores.
I realized that the vector store retriever doesn't know how to pull the doc ids, beyond that the simple vector store doesn't even consider the doc_ids passed to the VectorStoreQuery object (https://github.com/jerryjliu/llama_index/blob/c5d8768f5d0e5789e977c474457b2634f452957e/gpt_index/vector_stores/simple.py#L113)

@jerryjliu
Copy link
Collaborator

Thanks for flagging, this is a great point.

Yeah I think at the moment we get around this by instantiating a "new" DocumentStore in StorageContext.from_defaults if persist_dir is None (so if you create multiple vector stores and don't specify persist_dir, then you can avoid this problem).

However if you do specify persist_dir, then simple vector stores may have this problem. This is a TODO for us to add namespace support cc @Disiok

@jerryjliu jerryjliu added the bug Something isn't working label May 2, 2023
@khiemledev
Copy link

I'm facing the same problem. I'm looking for the update

@decent-engineer-decent-datascientist
Copy link
Author

decent-engineer-decent-datascientist commented May 5, 2023

@khiemledev This is not a fix, but you can make each sub index and save it on its own, and then create the graph on the fly when needed. Not an ideal solution, but it works for now

@khiemledev
Copy link

For me, I have each index saved on a different persist_dir. I can query each index and it works just fine. But after I add_nodes to an index loaded from storage, then query again, I got the error

@decent-engineer-decent-datascientist

@Disiok @jerryjliu any chance this was fixed recently? Reading through the commits, can’t see anything specific to this

@Disiok
Copy link
Collaborator

Disiok commented Jun 11, 2023

There's an initial fix (more of a bandaid for now). We pass the list of node ids from the vector index to the retriever, so it knows which subset of nodes.

    def as_retriever(self, **kwargs: Any) -> BaseRetriever:
        # NOTE: lazy import
        from llama_index.indices.vector_store.retrievers import VectorIndexRetriever

        return VectorIndexRetriever(
            self, doc_ids=list(self.index_struct.nodes_dict.values()), **kwargs
        )

@Disiok Disiok closed this as completed Jun 11, 2023
@larshenkelmann
Copy link

(in case others are wondering as I was) the PR seems to be #3695

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants
@jerryjliu @Disiok @khiemledev @decent-engineer-decent-datascientist @larshenkelmann and others