Import function fails to handle partial json objects and large files in v6.1.0 #766

t-jones · 2024-09-21T02:56:06Z

Describe the bug
In v6.1.0, if an import file is larger than 20000 bytes and hence requires multiple reads from S3, json objects at the beginning of each chunk after the first chunk are truncated and so fail to parse, which causes an unknown number of qid to be dropped when importing a qna export. Also, the new code will read 15 chunks of import data from S3. If each chunk is only ~20k, this is only 300K, which is much too small. These values should be configurable, or the threshold needs to be much higher.

Replacing this code with the v6.0.1 version fixes the issue. This is a regression over previous behavior.

To Reproduce

Create or procure a large export file. This was recreated with a 1MB file containing 195 qid.
Import the file.
Compare the number of successfully imported QID to the actual number in the file. In this case, 54 of the 195 qid were successfully imported.

Expected behavior
All qid should be imported.

Please complete the following information about the solution:

[X ] Version: (SO0189) QnABot with admin and client websites - Version v6.1.0
[ X] Region: us-west-2
[ X] Was the solution modified from the version published on this repository? No
[ X] Have you checked your service quotas for the services this solution uses? - This issue is not caused by service quotas.
[ X] Were there any errors in the CloudWatch Logs? Yes, there are several errors that look like the following:

2024-09-20T22:42:41.262Z	80513f94-33be-4d7a-ab5b-b879ddec27cf	INFO	Failed to Parse: Unexpected token u in JSON at position 0 undefined <partial json text from the import file>

Stack trace looks like

2024-09-20T22:42:41.262Z	80513f94-33be-4d7a-ab5b-b879ddec27cf	INFO	SyntaxError: Unexpected token u in JSON at position 0
    at JSON.parse (<anonymous>)
    at processQuestionObjects (/var/task/index.js:242:28)
    at exports.step [as handler] (/var/task/index.js:105:60)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

Additional context
There is a log line like

2024-09-20T22:42:41.438Z	80513f94-33be-4d7a-ab5b-b879ddec27cf	INFO	ContentRange:  bytes 20001-40001/840237

indicating that this is a multi-read file. Also, the first parse of a new chunk always fails. Also at 840237 bytes, less than half this file is actually processed.

Couple other points:

Would be great to take out all the Embeddings disabled - EMBEDDINGS_ENABLE: false log lines. Perhaps they should be in debug only? Coming from embeddings.js.

The text was updated successfully, but these errors were encountered:

fhoueto-amz · 2024-09-22T11:14:56Z

Thanks, we will look into this and come back

tmekari · 2024-09-27T17:29:50Z

This has been addressed in our v6.1.1 release. Thanks for bringing this to our attention, please feel free to reach out if you have any other issues!

t-jones added the bug label Sep 21, 2024

fhoueto-amz assigned tmekari Sep 22, 2024

This was referenced Sep 26, 2024

Update to version v6.1.1 #767

Closed

Update to version v6.1.1 #768

Closed

Update to version v6.1.1 #769

Merged

tmekari closed this as completed Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import function fails to handle partial json objects and large files in v6.1.0 #766

Import function fails to handle partial json objects and large files in v6.1.0 #766

t-jones commented Sep 21, 2024

fhoueto-amz commented Sep 22, 2024

tmekari commented Sep 27, 2024

Import function fails to handle partial json objects and large files in v6.1.0 #766

Import function fails to handle partial json objects and large files in v6.1.0 #766

Comments

t-jones commented Sep 21, 2024

fhoueto-amz commented Sep 22, 2024

tmekari commented Sep 27, 2024