Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unconditionally include API token for public files when forwarding to "explore" tools #7598

Open
mcgaerty opened this issue Feb 10, 2021 · 11 comments

Comments

@mcgaerty
Copy link

Issue/Surprise:
When sending the user to an external tool of type "explore" for a file, the user's API token is omitted if the file is part of a fully public dataset. The API token is only included when the file is either part of a draft or resides in a restricted dataset/dataverse (see below code snippet). This happens despite the tool declaring the API token to be part of the tool invokation.

Expected Behavior:
If the external tool declaration contains the {apiToken} reserved word as part of its parameter specification, it could be expected to always receive a valid API token as part of the request.

Request Context:
In an ongoing project we are exploring the technical possibilities of providing partial access (aka excerpt generation) to protected (copyrighted) text material in accordance with the (local) legal framework. Since Dataverse is already used as data repository at our university and the external tools API offers an excellent method of integrating additional/outsourced functionality, we opted for it as basis for our project. As part of the intended workflow a public dataset (e.g. the unrestricted publication or "cleaned" research data) would include a manifest file (JSON, JSON-LD) that links to the protected associated resource for which an excerpt could be created (the tool has access to the restricted section due to a separate API token of a read-only user). We would prefer to not duplicate user authentication/management, so the tool uses the user info API of the dataverse calling it to track each user's quota on restricted files. This subsequently breaks if the dataverse decides to not include the API token when invoking the external tool.

Associated Code:
Section resppnsible for current behavior is located in method explore of edu.harvard.iq.dataverse.FileDownloadServiceBean:

ApiToken apiToken = null;
User user = session.getUser();
DatasetVersion version = fmd.getDatasetVersion();
if (version.isDraft() || (fmd.getDataFile().isRestricted())) {
    if (user instanceof AuthenticatedUser) {
        AuthenticatedUser authenticatedUser = (AuthenticatedUser) user;
        apiToken = authService.findApiTokenByUser(authenticatedUser);
        if (apiToken == null) {
            //No un-expired token
            apiToken = authService.generateApiTokenForUser(authenticatedUser);
        }
    } else if (user instanceof PrivateUrlUser) {
        PrivateUrlUser privateUrlUser = (PrivateUrlUser) user;
        PrivateUrl privateUrl = privateUrlService.getPrivateUrlFromDatasetId(privateUrlUser.getDatasetId());
        apiToken = new ApiToken();
        apiToken.setTokenString(privateUrl.getToken());
    }
}

I searched existing issues for a rationale of this case differentiation, but could not find nothing specific. So my main question is whether removing the check and unconditionally include API tokens when the external tools specifies them as parameter would break existing contracts or violate security considerations?

@djbrooke
Copy link
Contributor

Hey @mcgaerty - sounds like an interesting use case. Is there a way we can learn more, like a call or a demo?

I'll let more technical team and community members provide more detailed responses but I'm guessing we don't provide the API token when we don't need to do so.

The OpenDP team (opendp.io) is working in this space a bit. They want to create an account on their system after a user launches a tool from a Dataverse installation, for tracking purposes. We built an endpoint for them that can retrieve user information (#7345). We included this in a recent release. This was more geared towards cases where people were launching a configure tool (https://guides.dataverse.org/en/latest/api/external-tools.html) instead of an explore tool, but it may stir some thoughts here.

@qqmyers
Copy link
Member

qqmyers commented Feb 10, 2021

I think the intent was to limit the exposure of the api key when not needed. There are open issues to provide shorter-term/more restricted alternatives to the api key going forward that should be able to support this use case, but, in the meantime, one suggestion would be to always send the apikey to explorer tools but not previewers (now that those are distinct). That would keep previewers working as is while allowing this case in the near term.

@mcgaerty
Copy link
Author

@djbrooke Currently we have no publicly available demo ready (yet), as the technical side of the project is currently under experimental development. A minimalistic overview of the approach is available here (unfortunatey only available in German currently).

@qqmyers The things that kinda confused me when digging through the code after I realized that for some test files my external tool did not receive the API token parameter at all:
preview tools get the key unconditionally ( edu.harvard.iq.dataverse.FilePage#preview)
configure tools get the key unconditionally (edu.harvard.iq.dataverse.ConfigureFragmentBean#getConfigurePopupToolHandler)
explore tools on dataset level get the key unconditionally (edu.harvard.iq.dataverse.DatasetPage#explore)
only explore tools on file level are restricted in above mentioned way (edu.harvard.iq.dataverse.FileDownloadServiceBean#explore).
So I figured there had to be some reason for that distinction.

We could switch to the dataset level for our tool but that would result in certain usability penalties:
i) As far as I am aware there are no filtering mechanisms available (yet) to restrict the availability of external tools for datasets based on properties of that dataset (e.g. "must contain non-empty file named 'xsample_manifest.json'"), so users would only find out whether or not their intended action was possible after being redirected to the external tool, that tool fetching the dataset content, analyzing it and then arriving at the conclusion that it is unable to serve the request. The existing filter mechanisms for file level tools are far more friendly in that regard, as we can greatly limit the number of false positives the user will get when interacting with the tool via dataverse.
ii) A single dataset can contain multiple files for which our excerpt-generation workflow is possible, requiring further steps on the external tool's side for the user to properly select which resource exactly to use. If possible we would prefer to avoid that extra overhead since the dataverse UI already offers good and intuitive navigation inside a dataset.

@pdurbin
Copy link
Member

pdurbin commented Feb 10, 2021

We've been following the principle of least privilege when it comes to passing around API tokens. If the token isn't needed, don't send it.

The problem with API tokens is that they allow the same permissions as the user itself, including deleting data. It would be especially bad for a superuser API token to fall into the wrong hands.

That said, we do pass them around readily (it's all we have!) and it's up to the admin of the installation to decide which external tools to authorize.

Some thoughts on where to go from here:

  • The easiest thing would be to add a new setting called something like :AlwaysSendApiToken that perhaps could be off by default.
  • Or we could have some kind of flag within each tool's JSON manifest indicating if it should always send API tokens.
  • Or we could invent something that helps you track users without needing the API token. It sounds like like the tool itself has its own token for read only access and the user's token is used for tracking.

As for inconsistencies in the code, it's probably the result of multiple developers working on it. 😄

You're right about dataset-level external tools. They always appear and there's no way to know the intention of the user who clicked them.

Anyway, neat project. Thanks for getting in touch!

@qqmyers
Copy link
Member

qqmyers commented Feb 10, 2021

Ah - previewers were originally synonymous with explore tools and were launched on a separate page (where the URL used to launch them was visible). It looks like the move to only show previews in an iframe didn't keep the test to not send the key when it wouldn't be needed. So, explore tools end up being the odd case. (Config tools are expected to be writing info back, so they have been getting the key.)

Regardless, the main driver for the restriction was the previewers so with that now handled differently, I don't think it should be a concern to send the api key to explore tools (if they ask for it).

An alternative might be to add the ability to send the user id instead of the key. It should be easy to add that as one of the available parameters. Used with an api key supplied out-of-band that allows access (as it sounds like you do now) would probably cover the same use case. Fewer keys being transferred but the external tool has to store one that has broad access.

@poikilotherm
Copy link
Contributor

A few days ago I looked into MicroProfile JWT Auth. Maybe using JWT for external tools would be sth. to avoid handing out API tokens? Just a crazy idea.

@mcgaerty
Copy link
Author

@qqmyers Yes, in our workflow the tool is supplied a kind of "master key" for (read-only) access to a subverse hosting the protected data that must not go fully public and that the users of the external tool themselves don't have access to anyway. And yes again, having the user id send to the tool instead of the key for public files would indeed perfectly fit our use case. Since currently all we do with the key is use the user info API endpoint to fetch said user id for tracking. I imagine our project is not the only one with this kind of setup/requirement and exposing the user id (or a comparably persistent token used only for identification) would pose a far smaller risk compared to the API key that holds the full set of (write) rights the user possesses.

@pdurbin
Copy link
Member

pdurbin commented Feb 11, 2021

or a comparably persistent token used only for identification)

That would be the database id of the user. This should be more persistent than the username/identifier which can change: https://guides.dataverse.org/en/5.3/api/native-api.html#change-identifier-label

@pdurbin
Copy link
Member

pdurbin commented Oct 9, 2022

A minimalistic overview of the approach is available here

At #dataverse2022 @mcgaerty gave a talk about XSample (Excerpt Generation via External Tool):

Just checking... @mcgaerty are you still interested in this issue? Thanks.

@mcgaerty
Copy link
Author

mcgaerty commented Oct 9, 2022

Thanks for checking @pdurbin , and yes we are still interested. The XSample project is currently shelved due to shifts in personnel, but we plan on continuing development on it.

PS: I gave that linked talk, but I know of a few instances where the two of us got mixed up, so that happens ;)

@pdurbin
Copy link
Member

pdurbin commented Oct 11, 2022

@mcgaerty whoops! I edited my earlier comment to remove the other person. Sorry about that. 😅 Interesting talk!

You might be interested in this PR:

You'd be welcome to review it, of course. Not sure if it helps your use case or not.

@pdurbin pdurbin added the Type: Suggestion an idea label Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants