Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import function fails to handle partial json objects and large files in v6.1.0 #766

Closed
t-jones opened this issue Sep 21, 2024 · 2 comments
Closed
Assignees
Labels

Comments

@t-jones
Copy link
Contributor

t-jones commented Sep 21, 2024

Describe the bug
In v6.1.0, if an import file is larger than 20000 bytes and hence requires multiple reads from S3, json objects at the beginning of each chunk after the first chunk are truncated and so fail to parse, which causes an unknown number of qid to be dropped when importing a qna export. Also, the new code will read 15 chunks of import data from S3. If each chunk is only ~20k, this is only 300K, which is much too small. These values should be configurable, or the threshold needs to be much higher.

Replacing this code with the v6.0.1 version fixes the issue. This is a regression over previous behavior.

To Reproduce

  1. Create or procure a large export file. This was recreated with a 1MB file containing 195 qid.
  2. Import the file.
  3. Compare the number of successfully imported QID to the actual number in the file. In this case, 54 of the 195 qid were successfully imported.

Expected behavior
All qid should be imported.

Please complete the following information about the solution:

  • [X ] Version: (SO0189) QnABot with admin and client websites - Version v6.1.0
  • [ X] Region: us-west-2
  • [ X] Was the solution modified from the version published on this repository? No
  • [ X] Have you checked your service quotas for the services this solution uses? - This issue is not caused by service quotas.
  • [ X] Were there any errors in the CloudWatch Logs? Yes, there are several errors that look like the following:
2024-09-20T22:42:41.262Z	80513f94-33be-4d7a-ab5b-b879ddec27cf	INFO	Failed to Parse: Unexpected token u in JSON at position 0 undefined <partial json text from the import file>

Stack trace looks like

2024-09-20T22:42:41.262Z	80513f94-33be-4d7a-ab5b-b879ddec27cf	INFO	SyntaxError: Unexpected token u in JSON at position 0
    at JSON.parse (<anonymous>)
    at processQuestionObjects (/var/task/index.js:242:28)
    at exports.step [as handler] (/var/task/index.js:105:60)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

Additional context
There is a log line like

2024-09-20T22:42:41.438Z	80513f94-33be-4d7a-ab5b-b879ddec27cf	INFO	ContentRange:  bytes 20001-40001/840237

indicating that this is a multi-read file. Also, the first parse of a new chunk always fails. Also at 840237 bytes, less than half this file is actually processed.

Couple other points:

  • Would be great to take out all the Embeddings disabled - EMBEDDINGS_ENABLE: false log lines. Perhaps they should be in debug only? Coming from embeddings.js.
@fhoueto-amz
Copy link
Member

Thanks, we will look into this and come back

@tmekari
Copy link
Contributor

tmekari commented Sep 27, 2024

This has been addressed in our v6.1.1 release. Thanks for bringing this to our attention, please feel free to reach out if you have any other issues!

@tmekari tmekari closed this as completed Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants