Avert cross species contamination in VEP cache dump #1575
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ENSVAR-6087
Problem
In VEP cache we generally have
pubmed
andvar_synonyms
data from the database. For past couple of release we are getting erratically missing data from them. Note that these two are the only field that are queried from the database and dumped into files and read back again from those dump file when creating the cache.Cause
After adding some debug data (the contents of the dump files read in each job), the cause behind these missing data is found out. We generally keep the data loaded from the dump files in-memory for a species between jobs. But it seems that somehow these objects is persisting jobs between different species and hence one species is getting data from another species.
Solution
Simply do not keep in-memory objects.
Test
Tested with fix and compared the result between different species. The result can be viewed in above JIRA ticket.