Potential bug with 0.6.0 shared storage_context and vector store retrieval #1769

decent-engineer-decent-datascientist · 2023-05-01T23:02:35Z

Hey friends!

There's a very good chance that I'm misunderstanding the philosophy with this new refactor, but if I'm not, there may be a bug with how the vector store retriever works.

My understanding is that the storage context is something we can pass to multiple indices to reuse nodes and vectors (coo coo, very nice). However, looking at the retrieval method for the vector stores, there's nothing preventing a node that exists in different indices to be picked up as the closest match (https://github.com/jerryjliu/llama_index/blob/c5d8768f5d0e5789e977c474457b2634f452957e/gpt_index/indices/vector_store/retrievers.py#L73)

Should there exist a node level check to ensure that the specific nodes exist for that given index? From what I gather, there's a filter on the document level, but if a document was parsed differently for different indices, these nodes would have different node ids with the same document id?

This would lead to errors such as:

     78     raise ValueError(
     79         "Vector store query result should return at "
     80         "least one of nodes or ids."
     81     )
     82 assert isinstance(self._index.index_struct, IndexDict)
---> 83 node_ids = [
     84     self._index.index_struct.nodes_dict[idx] for idx in query_result.ids
     85 ]
     86 nodes = self._docstore.get_nodes(node_ids)
     87 query_result.nodes = nodes

File ~/miniconda3/lib/python3.10/site-packages/llama_index/indices/vector_store/retrievers.py:84, in <listcomp>(.0)
     78     raise ValueError(
     79         "Vector store query result should return at "
     80         "least one of nodes or ids."
     81     )
     82 assert isinstance(self._index.index_struct, IndexDict)
     83 node_ids = [
---> 84     self._index.index_struct.nodes_dict[idx] for idx in query_result.ids
     85 ]
     86 nodes = self._docstore.get_nodes(node_ids)
     87 query_result.nodes = nodes

KeyError: '9e513d49-4c87-49f1-ba2f-ef9003529145'```

The text was updated successfully, but these errors were encountered:

decent-engineer-decent-datascientist · 2023-05-01T23:38:05Z

Adding to this. This impacts every shared storage context with simple vector stores.
I realized that the vector store retriever doesn't know how to pull the doc ids, beyond that the simple vector store doesn't even consider the doc_ids passed to the VectorStoreQuery object (https://github.com/jerryjliu/llama_index/blob/c5d8768f5d0e5789e977c474457b2634f452957e/gpt_index/vector_stores/simple.py#L113)

jerryjliu · 2023-05-02T07:42:09Z

Thanks for flagging, this is a great point.

Yeah I think at the moment we get around this by instantiating a "new" DocumentStore in StorageContext.from_defaults if persist_dir is None (so if you create multiple vector stores and don't specify persist_dir, then you can avoid this problem).

However if you do specify persist_dir, then simple vector stores may have this problem. This is a TODO for us to add namespace support cc @Disiok

khiemledev · 2023-05-05T03:09:04Z

I'm facing the same problem. I'm looking for the update

decent-engineer-decent-datascientist · 2023-05-05T15:39:01Z

@khiemledev This is not a fix, but you can make each sub index and save it on its own, and then create the graph on the fly when needed. Not an ideal solution, but it works for now

khiemledev · 2023-05-05T20:07:15Z

For me, I have each index saved on a different persist_dir. I can query each index and it works just fine. But after I add_nodes to an index loaded from storage, then query again, I got the error

decent-engineer-decent-datascientist · 2023-05-08T13:31:21Z

@Disiok @jerryjliu any chance this was fixed recently? Reading through the commits, can’t see anything specific to this

Disiok · 2023-06-11T07:40:11Z

There's an initial fix (more of a bandaid for now). We pass the list of node ids from the vector index to the retriever, so it knows which subset of nodes.

    def as_retriever(self, **kwargs: Any) -> BaseRetriever:
        # NOTE: lazy import
        from llama_index.indices.vector_store.retrievers import VectorIndexRetriever

        return VectorIndexRetriever(
            self, doc_ids=list(self.index_struct.nodes_dict.values()), **kwargs
        )

larshenkelmann · 2023-06-29T11:06:15Z

(in case others are wondering as I was) the PR seems to be #3695

jerryjliu added the bug Something isn't working label May 2, 2023

Disiok closed this as completed Jun 11, 2023

dosubot bot mentioned this issue Sep 15, 2023

[Bug]: KeyError in retriever.retrieve when sent index does not include all nodes in document store #7684

Closed

dosubot bot mentioned this issue Sep 22, 2023

[Bug]: Additional exception handling needed in retriever.py #7791

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential bug with 0.6.0 shared storage_context and vector store retrieval #1769

Potential bug with 0.6.0 shared storage_context and vector store retrieval #1769

decent-engineer-decent-datascientist commented May 1, 2023

decent-engineer-decent-datascientist commented May 1, 2023

jerryjliu commented May 2, 2023

khiemledev commented May 5, 2023

decent-engineer-decent-datascientist commented May 5, 2023 •

edited

Loading

khiemledev commented May 5, 2023

decent-engineer-decent-datascientist commented May 8, 2023

Disiok commented Jun 11, 2023

larshenkelmann commented Jun 29, 2023

Potential bug with 0.6.0 shared storage_context and vector store retrieval #1769

Potential bug with 0.6.0 shared storage_context and vector store retrieval #1769

Comments

decent-engineer-decent-datascientist commented May 1, 2023

decent-engineer-decent-datascientist commented May 1, 2023

jerryjliu commented May 2, 2023

khiemledev commented May 5, 2023

decent-engineer-decent-datascientist commented May 5, 2023 • edited Loading

khiemledev commented May 5, 2023

decent-engineer-decent-datascientist commented May 8, 2023

Disiok commented Jun 11, 2023

larshenkelmann commented Jun 29, 2023

decent-engineer-decent-datascientist commented May 5, 2023 •

edited

Loading