Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't directly proxy error statuses from arbitrary third parties #559

Open
seanh opened this issue Aug 24, 2021 · 0 comments
Open

Don't directly proxy error statuses from arbitrary third parties #559

seanh opened this issue Aug 24, 2021 · 0 comments
Assignees
Labels

Comments

@seanh
Copy link
Contributor

seanh commented Aug 24, 2021

Problem

Currently if a third-party server responds with a 4xx or 5xx error Via's NGINX component (that handles PDF proxying) directly proxies that response so Via itself responds with the same 4xx or 5xx. The problems with this are:

  • It can cause noise in our monitoring because users proxying broken third-party servers can cause Via to send 4xx or 5xx responses. If it's "normal" for Via to 4xx or 5xx then that can cover up actual issues
  • It can cause false alarms if users proxying broken third-party servers can trigger alarms that think Via is unhealthy or down because it's 4xx'ing or 5xx'ing, waking up on-call engineers. This can cause us to ignore or disable alarms and could cover up actual problems

History

Previously Via actually didn't proxy error responses from third-parties but this was implemented by completely swallowing the third-party status code and body. This led to difficulty in diagnosing a real issue in production when Via's proxying of Google Drive really was broken. As a result we removed the error hiding code but this leaves Via open to the monitoring noise and false alarms mentioned above.

This doesn't apply to Google Drive

Via now special-cases Google Drive URLs and routes them to a separate Python endpoint. This issue is just about all the other PDF URLs that get routed to the NGINX endpoint

Some file hosts might be special

We don't want arbitrary broken third-party servers to be able to cause Via to respond with arbitrary 4xx and 5xx codes and create monitoring and alerting noise. But we probably do want error responses from Google Drive (for example) to trigger errors from Via: we do want alarms to go off if proxying Google Drive is failing. This is because Google Drive is:

  1. A particularly important file host for our users, because the LMS app has a built-in Google Drive integration
  2. A file host that we trust to be reliable and well-behaved

Google Drive is already special-cased and routed to a separate Python endpoint so any changes to the NGINX endpoint won't affect Google Drive anyway. But the same reasoning might apply to other files hosts that the LMS app has pickers for: Canvas and Blackboard Files, Microsoft OneDrive. It might also apply to some hosts that we don't have pickers for if we know users use that host a lot, e.g. maybe Dropbox, or Box.com.

Canvas and Blackboard files may not be clear cut since not all Canvas and Blackboard instances are hosted by Instructure and Blackboard. Schools can host their own. So they may not always be as reliable.

Google Drive happens to be easy to special-case because all its URLs are drive.google.com but this may not be so easy for all hosts that we might want to special case, some of them might not be easy to identify.

The original status code and response still needs to be accessible somewhere

For debugging.

Solution

Some ideas:

  • Replace third-party error statuses with our own singular "third-party was broken" status code, so as not to create monitoring noise or trigger false alarms
  • Don't replace the third-party error bodies. Leave those intact for debugging
    • Note that there's no point in Via returning a nice error page here. Via's response goes to PDF.js's JavaScript code, it is not shown to the user. (For debugging you can find the response body in dev tools, which is why proxying through the third-party's response body can be useful for debugging.)
  • Copy the third-party's original status header into a custom header, for debugging
  • Exempt certain important domains (such as Microsoft OneDrive) that we do want to trigger alarms. We should continue to proxy error statuses for these domains

We might be able to use NGINX's ngx_http_sub_module to achieve some of the above or, failing that, the Lua module or the JavaScript module. The JavaScript module in particular seems squarely aimed at our use-case here.

@seanh seanh added the Backend label Aug 24, 2021
@seanh seanh self-assigned this Sep 13, 2021
@seanh seanh changed the title Don't completely swallow third-party PDF error responses Don't directly proxy error statuses from arbitrary third parties Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant