Skip to content

07. Network Access

kwrodarmer edited this page May 27, 2020 · 1 revision

Network Access

This page is likely to branch out to other pages dealing with network-related issues.

For today, I'm creating this page to capture a response to a user's question about operation without an Internet connection.

Is Internet Access Required?

When the SRA started, our main mode of locating database objects was to search the filesystem according to some rules and configuration files. Later on, we introduced remote file retrieval to the tools themselves for on-demand access. The process was

  1. look for objects locally under user's account
  2. look for objects on shared filesystems - the "site" repository
  3. look for objects remotely at NCBI - over the network

Later on, some very large run submissions began to contain morbid malformations that sometimes are the result of bad aligners but also can be the result of intentional manipulations. These affected our software's cache predictions and introduced severe random access that increase processing times by an order of magnitude. Since we do not tell submitters what to put into their runs, we instead introduced a mechanism to cache the expensive seeks into a secondary object, identified by a ".vdbcache" extension. These objects could be added retroactively, so the primary object was not aware of the cache.

Because our default mechanism of locating files was to search the filesystem, there would be no mechanism to indicate whether a vdbcache object was available or not, other than by searching for it too. This meant that even if you had downloaded a run SRRxxxxxx and apparently had no further need for network access, we probably would try if no vdbcache file could be found locally, since we would probe to see if one might exist remotely.

By now, we have so many objects in the SRA that it is ineffective to search multiple filesystems (especially over SMB) and we have expanded the meaning of a run accession so that it includes not only ETL but original and derived versions. We attempt to contact the SRA metadata database to learn what we need to know about an SRA object while processing.

The SRA and the SRA Toolkit have evolved from simple CLI tools that operate in transparent ways to complex tools that interoperate with a backend. Our move to the cloud has only increased the complexity as we try to preserve as much backward compatibility as possible at the same time that we introduce new features.

As we go forward, we will try to smooth out the bumps that have been introduced, but may not be able to address everyone's needs simultaneously.