Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/redshift iam auth #818

Merged
merged 8 commits into from
Jul 4, 2018
Merged

Conversation

drewbanin
Copy link
Contributor

@drewbanin drewbanin commented Jul 2, 2018

Reopening from #769 to integrate with new jsonschema validations.

Closes #757

@cmcarthur we're not currently running integration tests against Redshift... I think it's about time to change that! I'm going to see if I can add unit tests to at least validate the profiles.yml configs.

Usage

This PR adds IAM authentication for Redshift to dbt. A new method parameter is available for redshift connections. If the method parameter is not supplied, or if the value database is given, then simple user/pass authentication will be used. If the method parameter is set to iam, then dbt will use IAM Authentication to generate temporary credentials for the connection.

Example profile with both auth methods shown below:

default:
  target: iam
  outputs:
    iam:
      type: redshift
      method: iam               # NEW!
      cluster_id: garage        # NEW!
      iam_duration_seconds: 900 # NEW!
      host: redshift.host
      user: drew
      port: 5439
      dbname: analytics
      schema: dbt_dbanin
      threads: 1
    dev:
      type: redshift
      host: redshift.host
      user: drew
      pass: '...'
      port: 5439
      dbname: analytics
      schema: dbt_dbanin
      threads: 1

Properties:

method:
Can be one of iam, database, or unsupplied. Default=database

iam_duration_seconds:
The number of seconds after which the temporary credentials should expire. Default=900 (via boto)

cluster_id:
The name of the cluster (used to generate default credentials)

Discussion

The process of setting up IAM authentication is a little confusing. The documentation provided by Amazon is expectedly a little verbose, but I was able to get this set up with little prior information in about 15 minutes. Let's just be sure to link to the Amazon documentation when this is released.

Thanks to @danielchalef for the original code for this PR!

Copy link
Member

@cmcarthur cmcarthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code looks great, but your circle build failed. just fix that and then :shipit:

did you set this up using a ~/.aws/credentials file?

'iam_duration_seconds': {
'type': ['null', 'integer'],
'minimum': 900,
'maximum': 3600
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean if a model takes >1hr to run, that connection will close? (that's fine, just want to understand)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's right. I couldn't find any specifics in the docs, but keen to set the timeout to 900s then experiment with a time.sleep() after acquiring the connection. I bet that once a query is issues, it will complete even if the creds are expired, but subsequent queries on the same connection will likely fail.

I wish we could make the timeout longer, but 3600s is the max per the docs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is pretty typical of aws federated auth. >1hr would be great.

@drewbanin
Copy link
Contributor Author

@cmcarthur yeah, this uses the ~/.aws/credentials file via boto3.

I (think) I fixed the unit tests, and I just added a bunch more in lieu of a working integration test against Redshift. Gave this a couple of spins against a Redshift cluster and it looks like the tmp credentials + validation are working really well!

@cmcarthur
Copy link
Member

cmcarthur commented Jul 3, 2018

pep8 failed

dbt/adapters/redshift/impl.py:21:80: E501 line too long (81 > 79 characters)

fix this and then :shipit:

@cmcarthur
Copy link
Member

cmcarthur commented Jul 3, 2018

@drewbanin where should we put documentation issues for this release? i just want to make sure we update the docs to say, if you use in-transaction post hooks and this auth method, it can cause problems because of the credential timeouts

EDIT: OH wait, what happens if you have a timeout of 60s, issue a BEGIN, run a query that takes 90s, and then try to COMMIT it?

@drewbanin
Copy link
Contributor Author

I'm a little confused about the timeout behavior here: I set up my IAM credentials to expire after 900 seconds, and then I configured a model to sleep for 960 seconds with some post hooks. The log output looks like:


2018-07-03 14:09:35,742: Using redshift connection "long_model".
2018-07-03 14:09:35,742: On long_model: create table
"dbt_dbanin"."long_model__dbt_tmp as (
select f_sleep(960)
);
2018-07-03 14:25:40,258: SQL status: SELECT in 964.52 seconds
2018-07-03 14:25:40,262: Using redshift connection "long_model".
2018-07-03 14:25:40,262: On long_model:
select 1 as id
2018-07-03 14:09:35,693: SQL status: SELECT in 0.02 seconds


So, really interestingly, the query succeeded even though it ran for longer than the specified timeout. Also, a subsequent query using the same connection also completed successfully.

Going to keep poking around to get a good feel for how this works

@drewbanin drewbanin self-assigned this Jul 3, 2018
@drewbanin
Copy link
Contributor Author

@cmcarthur I can very reliably reproduce queries that continue to work after they should have expired. I'm even checking the expiration time returned by boto3. My logs effectively look like:

PASSWORD EXPIRES AT: 2018-07-03T16:47:06.272000
....
2018-07-03 16:46:08,322: SQL status: SELECT in 120.21 seconds
2018-07-03 16:46:08,324: Using redshift connection "long_model".
2018-07-03 16:46:08,324: On long_model: 
      select f_sleep(120) as id

2018-07-03 16:48:08,550: SQL status: SELECT in 120.23 seconds
2018-07-03 16:48:08,553: Using redshift connection "long_model".
2018-07-03 16:48:08,553: On long_model: 
      select f_sleep(120) as id
    
2018-07-03 16:50:08,771: SQL status: SELECT in 120.22 seconds
2018-07-03 16:50:08,773: Using redshift connection "long_model".
2018-07-03 16:50:08,773: On long_model: 
      select f_sleep(120) as id
...

So, I think this must be a Redshift/AWS quirk/bug??? I dug through the boto3 source, but it's all just semantic wrappers around AWS APIs. I initially assumed that maybe they were throwing the timeout argument away and using some longer timeout instead, but I no longer believe that that's the case.

So, I think that we can probably just document the intended behavior, and point to the relevant AWS docs. We can include a caveat that, at the time of writing, iam_duration_seconds is not 100% reliable

@cmcarthur
Copy link
Member

whoa, that's unexpected! yeah, document the intended behavior and :shipit:.

@drewbanin drewbanin dismissed cmcarthur’s stale review July 3, 2018 21:52

connor gave me a :shipit: emoji

@drewbanin drewbanin merged commit 145a82b into development Jul 4, 2018
@drewbanin drewbanin deleted the feature/redshift-iam-auth branch July 4, 2018 00:50
@drewbanin drewbanin mentioned this pull request Jul 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants