Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Typing Loaded by DBT #1616

Closed
jacobm001 opened this issue Jul 18, 2019 · 3 comments
Closed

Incorrect Typing Loaded by DBT #1616

jacobm001 opened this issue Jul 18, 2019 · 3 comments

Comments

@jacobm001
Copy link

jacobm001 commented Jul 18, 2019

Issue description

I'm trying to load a CSV as a seed within my DBT model, but DBT is loading one of my columns incorrectly.

Results

I have a CSV file that looks like this:

category,test_code,start_date,end_date
SAT,SAT3,2019-07-18,
SAT,SATM,2019-07-18,
SAT,SATV,2019-07-18,
SAT,S07,2019-07-18,
SAT,TSWE,2019-07-18,

When I run DBT seed, it loads the first column as a date which gives me an output of:

insert into mytable (category, test_code, start_date, end_date) values
  ('1-01-06','SAT3','2019-07-18',NULL)
  , ('1-01-06','SATM','2019-07-18',NULL)
  , ('1-01-06','SATV','2019-07-18',NULL)
  , ('1-01-06','S07','2019-07-18',NULL)
  , ('1-01-06','TSWE','2019-07-18',NULL)

Failed attempts to solve the issue:

  • quoting the columns
  • declaring their typing via the package file
  • Using the notation mentioned in this issue. That led to an entirely different error.

Other things of note:

  • Removing the problem column does not negatively effect the others.
  • If I do something stupid like add a 0 to the end of the first column, it will stay a string instead of be run through a datetime conversion.
  • Having the first column as lower case makes no difference.
  • Reordering the columns does not change the behavior
  • The problem really seems to be with the string 'SAT'. The string 'cat' is fine in its place, and if I add a bogus line of 'stop_the_bug' for the first column with blanks in the other columns, the problem goes away.

System information

The output of dbt --version:

installed version: 0.14.0
   latest version: 0.14.0

Up to date!

The operating system you're running on: Windows

The python version you're using: 3.6.5

The DB target: Snowflake

@drewbanin
Copy link
Contributor

Thanks for the report @jacobm001 - that's not ideal. See also #999

dbt offloads csv loading to a library called agate. It's frequently overzealous with it's type inference: There's no good reason why SAT should be interpreted as "the first saturday on the calendar AD", which is what I think is happening here. I've also seen the NA country code load as null (as in "not available"), and we just saw a report of dbt trying to load a very long number string as an integer type (eg. 32482394823498234) which failed at insertion time.

I think the right solution here is going to be to merge #1541. We can just force agate to treat every column as a string, then leave it to the user to cast the columns as needed in a model. This is a pretty blunt approach, but the alternative is playing "whack a mole" with all of these strange edge-cases.

Let me know what you think about all that!

@jacobm001
Copy link
Author

I think that makes a lot more sense. I never considered it was reading that as Saturday... seems so obvious now that you've pointed it out...

My organization deals with a lot of coded strings that have leading 0s, so I think defaulting everything to a string is likely going to be the solution that causes the majority of people the least amount of pain. I'd much rather have to manually declare types than be at the mercy of agate!

@drewbanin
Copy link
Contributor

dupe of #999, closing this one. see #1541

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants