This is an application for files synchronization from different source repositories into a Dataverse installation. This application uses background processes for the synchronization of the files. The background processes are also used for hashing of the Dataverse files when the source repository uses different hash type than the Dataverse installation. These hashes are needed for the comparison of the files, allowing easier versioning of the files between the dataset versions (only the files that effectively have changed would be replaced and added to the overview of changes between different dataset versions). The frontend application does not need to be running when the synchronization is running on the server (users can close their browsers once that the synchronization has been set up), and multiple synchronizations for different users can run simultaneously, each on its own goroutine, scheduled as a "job" in the background. The number of simultaneously running jobs is adjustable, and the jobs are scheduled in "First In First Out" order.
This application can also be used as a stand-alone tool for uploading and synchronizing files from local storage to the Dataverse installation. Only the stand-alone tool allows synchronizing files from your local file system to a Dataverse installation.
Support for different repositories is implemented as plugins. More plugins will be added in the feature. At this moment, the following plugins are provided with the latest version:
Download the binary built for your system (Windows, Linux or Darwin/macOS) from the latest release and execute it by double-clicking on it or by running it in command-line. By default, the application will connect to the Demo Dataverse. If you wish to connect to a different Dataverse installation, run it in command-line with the server
parameters set to the Dataverse installation of your choice, e.g., on Windows system:
demo_windows.exe -server=https://demo.dataverse.org
For more details, see the section about running the application.
For building the frontend, you need to have Angular CLI installed. You will need to have the latest Go installed for compiling the code. If you wish to build the application's container, you will need to have the Docker installed. For running and building the applications from source using the make commands, you will need to have make installed. Finally, the state of the application (calculated hashes, scheduled jobs, etc.) is maintained by a Redis data store. When running this application on the server, you will need either access to an external Redis server, or one run by you locally. The stand-alone tool does not require any Redis server (or any other tool or library installed on your system), and can be simply run by executing a binary built for your operating system.
This application can be used by accessing the API directly (from cron jobs, etc.), or with a frontend providing GUI for the end users. The frontend (rdm-integration-frontend) needs to be checked out separately prior to building this application. Besides the frontend dependency, the build process use the following libraries and their dependencies (go build
command resolves them from go.mod
and go.sum
files, and they do not need to be installed separately):
The most straight forward way of running the application is to use the in the release provided binaries. Note that these binaries are only meant to be used by the end users and should not be used on a server. If you wish to build your own binaries or build this application for running on a server, see the section on running and building from source.
By default, the tool connects to the Demo Dataverse. If you wish to change the default configuration, you can execute the binary with -h
argument. This will list the possible configuration options. For example, if you wish to connect to a different Dataverse installation, run it in command-line with the server
and serverName
(free to choose display name of the server as show in the UI, e.g. My Dataverse
) parameters set to the Dataverse installation of your choice, e.g., you can run the executable with the following options: -server=https://your.dataverse.installation -serverName="Dataverse Installation"
. On Windows system, the full command looks like this (first change to the directory where the file is downloaded):
demo_windows.exe -server=https://your.dataverse.installation -serverName="Dataverse Installation"
You can also build your own binaries with different default values for the command-line arguments. See the next section for more detail (in the part about -X
flags).
In order to run the application locally, checkout in the same folder: this repository (rdm-integration), and the frontend repository (rdm-integration-frontend). Then go to /rdm-integration
directory (i.e., the directory where this repository is checked out) and run make run
. This script will also start a docker container containing the Redis data store, which is used the by the running application for storing the application state. Notice that if you do not run standard Libis RDM (Dataverse) locally, you will need to configure the backend to connect to your Dataverse installation server. See the "Backend configuration" section for more details.
You can also use the make commands to build the docker image (make build -e BASE_HREF='/'
). The resulting container can be used as a web server hosting the API and the frontend, or as a container running the workers executing the jobs requested by the frontend. For the purpose of scalability, both types of intended usage can have multiple containers running behind a load balancer. The default run command starts a container performing both tasks: a web server and a process controlling 100 workers:
docker run -v $PWD/conf:/conf --env-file ./env.demo -p 7788:7788 rdm/integration:1.0 app 100
After starting the docker container with the command above, verify that the web server is running by going to http://localhost:7788. If you wish to have a different number of simultaneously running workers, change 100
to the desired number. If you want the resulting container to function only as a web server, execute this command:
docker run -v $PWD/conf:/conf --env-file ./env.demo -p 7788:7788 rdm/integration:1.0 app
When running the web server separately from the workers, you will need at least one container running the workers, started with the following command:
docker run -v $PWD/conf:/conf --env-file ./env.demo -p 7788:7788 rdm/integration:1.0 workers 100
Building binaries with local file system plugin, just as the binaries included in the release (meant only for running by the end users and not on a server) is also done with the make command: make executable
. You may want to adjust that script by setting the variables to make the application connect to your Dataverse installation. By default, the built application connects to the Demo Dataverse. In order to change that, you must adapt the build command the following way (you can also run this command in the image directory, without the script):
go build -ldflags "-X main.DataverseServer=https://demo.dataverse.org -X main.RootDataverseId=demo -X main.DefaultHash=MD5" -v -o datasync.exe ./app/local/
These are the -X
flags that you can use at the build time to set the default values for the command line arguments (as listed by the datasync.exe -h
command):
- DataverseServer (sets the default value for the
server
argument): URL of the Dataverse installation that the built application will connect to by default - DataverseServerName (sets the default value for the
serverName
argument): display name of the Dataverse installation, e.g., "Demo Dataverse". This name is used only in the UI and is free to choose. - RootDataverseId (sets the default value for the
dvID
argument): ID of the root Dataverse collection of the Dataverse installation, e.g., "demo" - DefaultHash (sets the default value for the
hash
argument): most Dataverse installation use "MD5" as hashing algorithm: this flag can be omitted in most cases. - MyDataRoleIds (sets the default value for the
roleIDs
argument): this application uses theretrieve
"my data" API call. However, this API requires the Role ID (primary key of the role table where the particular role is stored on the database), which can be tricky to find. Only the datasets where the user has that particular role are returned by the server. If your Dataverse installation does not fill the dropdown for the dataset choice, then this flag should be adjusted. Otherwise, you can omit this flag. The default setting is6,7
representing thecontributor
andcurator
roles on most installations.
You can also build the binaries for multiple architectures at once with the make multiplatform_demo
command. Adapt the build commands in that script similarly as described for the make executable
command.
The backend configuration is loaded by the application from a file specified by the path stored in the BACKEND_CONFIG_FILE
environment variable. In order to set a value for that variable, you will need to export that variable to the OS running the application, e.g.:
export BACKEND_CONFIG_FILE=../conf/backend_config.json
Note that the stand-alone version does not need the backend configuration file and is configured by the -X
ldflags passed to the build command. You can also override these flags by adding arguments to the execution command, as described in the sections above.
An example of backend configuration can be found in backend-config.json. Another example, as can be used to connect to the Demo Dataverse, can be found in backend_config_demo.json. The BACKEND_CONFIG_FILE
environment variable specifies which configuration file will be loaded. The only two mandatory fields in the configuration file are the following:
- dataverseServer: URL of the server where Detaverse API is deployed.
- redisHost: the host containing the Redis data store (storing the application state).
Additionally, the configuration can contain the following fields in the optional "options" field:
- dataverseExternalUrl: this field is used to generate a link to the dataset presented to the user. Set this value if it is different from dataverseServer value, otherwise you can omit it.
- rootDataverseId: root Dataverse collection ID, needed for creating new dataset when no collection was chosen in the UI.
- defaultHash: as mentioned earlier, "MD5" is the default hash for most Dataverse installations. Change this only when your installation uses a different hashing algorithm (e.g., SHA-1).
- myDataRoleIds: role IDs for querying my data, as explained earlier in this section.
- pathToUnblockKey: path to the file containing the API unblock key. Configure this value to enable checking permissions before requesting jobs.
- pathToApiKey: path to the file containing the admin API key. Configure this value to enable url signing i.s.o. using the users Dataverse API tokens.
- pathToRedisPassword: by default no password is set, if you need to authenticate with Redis, store the path to the file containing the Redis password in this field.
- redisDB: by default, DB 0 is used. If you need to use another DB, specify it here.
- defaultDriver: default driver as used by the Dataverse installation, only "file" and "s3" are supported. See also the next section.
- pathToFilesDir: path to the folder where Dataverse files are stored (only needed when using the "file" driver).
- s3Config: configuration when using the "s3" driver, similar to the settings for the s3 driver in your Dataverse installation. Only needed when using S3 file system that is not mounted as a volume. See also the next section.
- pathToOauthSecrets: path to the file containing the OATH client secrets and POST URLs for the plugins configured to use OAuth for authentication. An example of a secrets file can be found in example_oath_secrets.json. As shown in that example, each OAuth client has its own entry, identified by the application ID. Each entry contains two fields: clientSecret containing the client secret, and postURL containing the URL where the post request for acquiring tokens should be sent to. See the frontend configuration section for information on configuration of OAuth authorization for the plugins.
- maxFileSize: maximum size of a file that can be uploaded to the Dataverse installation. When not set, or set to 0 (or value less than 0), there is no limit on file size that can be uploaded. The files that cannot be uploaded due to the file size limit are filtered out by the frontend and the user is notified with a warning.
- userHeaderName: URL signing needs the username in order to know for which user to sign, the user name should be passed in the header of the request. The default is "Ajp_uid", as send by the Shibboleth IDP.
- smtpConfig: configure this when you wish to send notification emails to the users: on job error and on job completion. For example, the configuration could look like this:
"smtpConfig": {
"host": "smtp.gmail.com",
"port": "587",
"from": "john.doe@gmail.com"
},
"pathToSmtpPassword": "/path/to/password/file"
- pathToSmtpPassword: path to the file containing the password needed to authenticate with the SMTP server
When running this tool on the server, you can take the advantage of directly uploading files to the file system where Dataverse files are stored (assuming that you have direct access to that file system from the location where this application is running). The most generic way is simply mounting the file system as a volume and configuring the application (in the backend configuration file) to use the "file" driver pointing to the mounted volume. For example:
{
"dataverseServer": "https://some.dataverse.com",
"redisHost": "localhost:6379",
"options": {
"dataverseServer": "https://some.dataverse.com",
"defaultDriver": "file",
"pathToFilesDir": "/path/to/mounted/volume"
}
}
As an alternative, you can access an s3 storage directly from this application, without the need of mounting it. First, you will need to configure the authentication by setting the following environment variables on the OS running this application:
- Access Key ID:
AWS_ACCESS_KEY_ID
orAWS_ACCESS_KEY
- Secret Access Key:
AWS_SECRET_ACCESS_KEY
orAWS_SECRET_KEY
The s3 driver is then configured in the backend configuration file, for example:
{
"dataverseServer": "localhost:8080",
"redisHost": "localhost:6379",
"options": {
"dataverseExternalUrl": "https://demo.dataverse.org",
"defaultDriver": "s3",
"s3Config": {
"awsEndpoint": "http://some.endpoint.here",
"awsRegion": "region",
"awsPathstyle": "true",
"awsBucket": "bucket"
}
}
}
Notice that the driver configuration is optional. When it is not set, no direct uploading is in use and simply the Dataverse API is called for storing the files. However, this can result in unnecessary usage of resources (network, CPU, etc.) and might slow down the Dataverse installation.
There are two types of possible customizations to the frontend. The first type is the customization done by the replacement of the HTML files, e.g., the footer.html and the header.html. The files that are going to be replaced are placed in the conf/customizations directory, that can also contain the files referenced by the custom HTML files. By default, only the make executable
and make multiplatform_demo
commands effectively replace these files while building. In order to add customizations into your make script, add the following line to the script: cp -r conf/customizations/* image/app/frontend/dist/datasync/
.
The second type is the configuration with a configuration file. The default configuration file (used when the configuration file is not specified in the FRONTEND_CONFIG_FILE
environment variable) can be found in default_frontend_config.json. In order to use a custom configuration file, set the FRONTEND_CONFIG_FILE
environment variable accordingly. An example of the configuration file, also used by the make scripts and the docker commands, can be found in frontend_config.json.
The configuration file can contain the following options for the frontend:
- dataverseHeader: the display name of the Dataverse installation.
- collectionOptionsHidden: if set to
false
(or omitted), an extra dropdown is shown that allows for collection selection within the Dataverse installation. The selected installation is then used for creating new dataset, when that option is enabled, and for filtering of the available datasets. - collectionFieldEditable: if set to
true
, the user can paste or type collection identifiers directly, without the use of the dropdown. - createNewDatasetEnabled: if set to
true
, it enables the "Create new dataset" button. - datasetFieldEditable: if set to
true
, the user can paste or type DOI identifiers directly, without the use of the dropdown. - externalURL: this option if filled out by the backend from the
dataverseExternalUrl
backend configuration file field, and should not be set manually. - showDvTokenGetter: set it to
true
to show the "Get token" button next to the Dataverse token field. - showDvToken: set it to
true
to show the token field (set it tofalse
when using URL signing). - redirect_uri: when using OAuth, this option should be set to the
redirect_uri
as configured in the OAuth application setting (e.g., GitHub application settings as described in this guide). The redirect URI must point to the/connect
page of this application. - storeDvToken: set it to
true
to allow storing Dataverse API token in the browser of the user. - sendMails: set it to
true
to enable sending mails to the user (you need to configure smtp settings in the backend configuration). - plugins: contains one entry for each repository instance, as described below.
Having multiple instances for plugin types is useful when certain features, e.g., OAuth authentication, can be configured for specific installations of a given repository type. It is perfectly possible to have at most one instance for each plugin type, as it is the case in the default_frontend_config.json. Plugins that er not configured will not be shown in the UI. The repository instance, configured as an entry in plugins
setting of the frontend configuration, can contain the following fields:
- id: unique identifier for the repository instance configuration.
- name: name of the instance, as displayed in the "Repository instance" field on the connect page, e.g. "KU Leuven GitLab".
- plugin: the identifier of the plugin, as implemented in registry.go, e.g.,
irods
,github
,gitlab
, etc. - pluginName: Display name of the plugin, as displayed in the "Repository type" dropdown.
- optionFieldName: when the plugin implements
Options
function, this field is set to the name of the implemented option, e.g., "branch" or "folder". - optionFieldPlaceholder: the placeholder for option field.
- tokenFieldName: when the user needs to authenticate with a API token or password to the given repository (e.g., OAuth is not configured for this repository instance), this field should be set to the name of the needed credential, e.g., "Token" or "Password"
- tokenFieldPlaceholder: the placeholder for the token field.
- sourceUrlFieldName: when configured, the UI will show the source URL field, where the user can enter the URL of the repository to connect to.
- sourceUrlFieldPlaceholder: the placeholder for the source URL field.
- sourceUrlFieldValue: when configured, it contains the default value for the source URL field. When this value is always the same for a given plugin, e.g.,
https://github.com
, then the sourceUrlFieldName can be left empty, and the field will not be shown (but will always contain the configured default value). - usernameFieldName: when the user needs to authenticate with a username to the given repository (e.g., OAuth is not configured for this repository instance), this field should be set to the name of this field, e.g., "Username"
- usernameFieldPlaceholder: the placeholder for the username field.
- repoNameFieldName: repository selection field name.
- repoNameFieldPlaceholder: the placeholder for the repository selection field.
- repoNameFieldEditable: if set to
true
, the user can paste or type repository name directly, without the use of the dropdown. - repoNameFieldValues: suggested or possible repository names. When this is filled out, a dropdown will be presented to the user, otherwise a text field will be presented.
- repoNameFieldHasSearch: when the plugin implements
Search
function, this field can be set totrue
for searchable repository names. - repoNameFieldHasInit: when the plugin implements
Search
function, this field can be set totrue
for initial search without the search term (initializes the dropdown to the default values as implemented by the search function). - parseSourceUrlField: when set to true, the repoName field can be left not configured and the repository name is parsed from the source URL field.
- tokenName: when set to a unique value, the credential needed for authentication is stored in the browser.
- tokenGetter: OAuth configuration for the repository instance containing the URL where authorizations should be redirected to, and the oauth_client_id from the OAuth application setting (e.g., GitHub application settings as described in this guide). See also the backend configuration section on how to configure the needed client secrets.
In order to integrate a new repository type, you need to implement a new plugin for the backend. The plugins are implemented in the image/app/plugin/impl folder (each having its own package). The new plugin implementation must be then registered in the registry.go file. As can be seen in the same file, a plugin implements functions that are required by the Plugin type:
type Plugin struct {
Query func(ctx context.Context, req types.CompareRequest, dvNodes map[string]tree.Node) (map[string]tree.Node, error)
Options func(ctx context.Context, params types.OptionsRequest) ([]string, error)
Search func(ctx context.Context, params types.OptionsRequest) ([]string, error)
Streams func(ctx context.Context, in map[string]tree.Node, streamParams types.StreamParams) (map[string]types.Stream, error)
}
Each plugin implements at leas these two functions:
- Query: using the standard fields as provided in the "types.CompareRequest" (username, API token, URL, etc.) this function queries the repository for files. The result is a flat mapping of files found on the repository to their paths. A file is represented by a "tree.Node" type containing the file name, file path, hash type and hash value, etc. Notice that it does not contain the file itself. The
dvNodes
parameters holds a copy of the nodes as present in the Dataset on the Dataverse installation (and can be ignored in most cases). - Streams: files are synchronized using streams from the source repository to the file system, where each file has its own stream. This function implements "types.Stream" objects for the provided files (the "in" parameter contains a filtered list of files that are going to be copied from the repository). Notably, a "types.Stream" object contains a function for opening a stream to the provided file and a function to close that stream.
Additionally, the plugins can implement the following functions:
- Options: this function lists branches (or folders in the case of IRODS) applicable for the current repository. It can be only called when the user has provided the credentials needed to call the repository (this is verified at the frontend) and the repository name that the options will apply to. These credentials and the repository name are then provided in the "types.OptionsRequest" value. This function needs only to be implemented when this functionality is needed by the given type of the repository.
- Search: when implemented, this function can be used for searching repositories by name, based on the search term provided by the user. It makes the selection of the repository process easier for the users.
After implementing the above-mentioned functions on the backend, the plugin needs to be configured at the frontend. It becomes then selectable by the user, with the possibility of different configurations for the specific repositories instances. See the section on frontend configuration for further details.
The sequence diagrams for search
and oauthtoken
are very similar to this one, and are not shown separately.
sequenceDiagram
Frontend->>+Backend: /api/plugin/options
Backend->>Repo: Specific call, e.g., list branches
Repo-->>Backend: List of branches
Backend-->>-Frontend: List of options for the dropdown
sequenceDiagram
Frontend->>+Backend: /api/common/dvobjects
loop Until all pages are retrieved
Backend->>Dataverse: /api/v1/mydata/retrieve
Dataverse->>Backend: Dataverse collections
end
Backend-->>-Frontend: Dataverse collections
sequenceDiagram
Frontend->>+Backend: /api/common/newdataset
Backend->>Dataverse: POST /api/dataverses/{{Dataverse collection}}/datasets
Dataverse-->>Backend: Response
Backend-->>-Frontend: Persistent ID of the new datase
sequenceDiagram
Frontend->>+Backend: /api/plugin/compare
Backend->>+Goroutine: Compare using Key as ref.
activate Goroutine
Backend-->>Frontend: Key
loop Until cached response ready
Frontend->>Backend: /api/common/cached
Backend->>Redis: Get(key)
Redis-->>Backend: Cached response if ready
Backend-->>Frontend: Cached response if ready
end
Goroutine->>Dataverse: List files
Dataverse-->>Goroutine: List of files
Goroutine->>Repo: List files
Repo-->>Goroutine: List of files
Goroutine->>Redis: Get known hashes
Redis-->>Goroutine: Known hashes
Goroutine->>Redis: Hashing job for unknown hashes
Goroutine->>Redis: Cached response is ready
deactivate Goroutine
loop Until all hashes known
Frontend->>Backend: /api/common/compare
Backend->>Redis: Get known hashes
Redis-->>Backend: Response
Backend-->>Frontend: Not all hashes known
end
Worker->>Redis: Get new job
Redis-->>Worker: Hashing job
activate Worker
loop Until all hashes known
Worker-->>Worker: Calculate N hashes
Worker->>Redis: Store calculated hashes
end
Worker->>Redis: All hashes known
deactivate Worker
sequenceDiagram
Frontend->>+Backend: /api/common/store
Backend->>Redis: Add new job
Backend->>Frontend: Job added
loop Until all files processed
Frontend->>Backend: /api/common/compare
Backend->>Redis: get processed files list
Redis-->>Backend: Response
Backend-->>Frontend: Not all files processed
end
Worker->>Redis: Get new job
Redis-->>Worker: Persisting job
activate Worker
loop Until all files processed
Worker-->>Worker: Process file (write or delete in dataset)
Worker-->>Redis: Notify file is processed
end