Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: KeyError in retriever.retrieve when sent index does not include all nodes in document store #7684

Closed
c64er4ever opened this issue Sep 15, 2023 · 3 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@c64er4ever
Copy link

Bug Description

Hi,
I have created multiple indexes, each covering different set of nodes. I am then using one of the indexes to retrieve nodes based on a query. However, it seems that the retriever is trying to process nodes that are out of the context of the specific index that I have provided, causing a KeyError exception in this line in llama_index/indices/vector_store/retrievers/retriever.py:
self._index.index_struct.nodes_dict[idx] for idx in query_result.ids

I noticed that it is possible to send list of node_ids to retrieve in order to constraint it. Haven't tried it yet and I guess that it will work. However, I believe that it is better if the retrieve function itself verifies that only nodes that are covered by the provided index are processed.

Thanks!
Guy

Version

0.8.5.post2

Steps to Reproduce

  1. Create doc store with X nodes (e.g. 100)
  2. Create index that covers only part of the nodes (e.g. 10)
  3. Use the created index to create a Retriever: retriever = VectorIndexRetriever(my_index)
  4. Call retrieve with a question: nodes = retriever.retrieve("This is my question")

If the query returns nodes that are excluded from the provided index, KeyError exception occurs

Relevant Logs/Tracbacks

No response

@c64er4ever c64er4ever added bug Something isn't working triage Issue needs to be triaged/prioritized labels Sep 15, 2023
@dosubot
Copy link

dosubot bot commented Sep 15, 2023

🤖

Hi Guy,

Thank you for bringing this issue to our attention. It seems like you've encountered a KeyError because the retriever is trying to process nodes that are not covered by the specific index you've provided. This is a known issue and a similar problem was addressed in a previous issue #1769.

The solution suggested in that issue was to pass the list of node ids from the vector index to the retriever, so it knows which subset of nodes to process. Here's the code snippet from that solution:

def as_retriever(self, **kwargs: Any) -> BaseRetriever:
    # NOTE: lazy import
    from llama_index.indices.vector_store.retrievers import VectorIndexRetriever

    return VectorIndexRetriever(
        self, doc_ids=list(self.index_struct.nodes_dict.values()), **kwargs
    )

This code ensures that the retriever only processes nodes that are included in the provided index, which should prevent the KeyError you're experiencing.

However, please note that this is more of a workaround than a permanent solution. The underlying issue is that the retrieve function does not verify that only nodes covered by the provided index are processed. We will consider addressing this in a future update.

As for your question about changes made in the LlamaIndex codebase after version 0.8.5.post2 that could potentially address this issue, I wasn't able to find any. However, I recommend updating to the latest version (v0.8.27) to benefit from the most recent bug fixes and improvements.

I hope this helps! Let me know if you have any other questions.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@logan-markewich
Copy link
Collaborator

The bot is right! If you aren't using index.as_retriever(..), then you need to pass in the doc_ids

@afalamna
Copy link

@logan-markewich I am facing this isssue and I added the code snippet provided by the bot in retriever.py but its still giving me this error. This is my code.

from llama_index.llms import OpenAI
from llama_index.embeddings import TextEmbeddingsInference
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, ServiceContext, StorageContext, load_index_from_storage
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.node_parser import SimpleNodeParser
from llama_index.query_engine import RetrieverQueryEngine

documents = SimpleDirectoryReader(
input_files=["uber_10q_june_2022.pdf","uber_10q_march_2022.pdf"]
).load_data()

embed_model = TextEmbeddingsInference(
model_name="BAAI/bge-large-en-v1.5", # required for formatting inference text,
timeout=60, # timeout in seconds
embed_batch_size=10, # batch size for embedding
)

llm = OpenAI(temperature=0, model="gpt-3.5-turbo", max_tokens=1024)
print('Processing......')
service_context = ServiceContext.from_defaults(llm=llm,embed_model=embed_model)
index = VectorStoreIndex.from_documents(documents = documents,service_context=service_context)
query_engine = index.as_query_engine(similarity_top_k=3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

3 participants