Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_dta support for long string (strL)? #437

Closed
deribo opened this issue Feb 28, 2019 · 10 comments
Closed

write_dta support for long string (strL)? #437

deribo opened this issue Feb 28, 2019 · 10 comments
Labels
feature a feature request or enhancement

Comments

@deribo
Copy link

deribo commented Feb 28, 2019

I need to process a .json containing logdata into a .dta file. The file itself contains a handful of string variables exceeding 5000 and more characters.

As Stata (since Stata 13) supports long string variables (strL variable) and haven() supports long string at least in write_sav() I wanted to ask if it is in any way possible to write strL-variables with haven (haven't found anything on this)? If not if there's support for this planned in the future?

Bests

library(haven)

longFun <- function(n) {
  do.call(paste0, replicate(5000, sample(LETTERS, n, TRUE), FALSE))
}

longString <- data.frame(V1 = longFun(1), stringsAsFactors = F)
                         
write_dta(longString, paste0(tempdir(), "\\longString.dta"))
#> Error in write_dta_(data, normalizePath(path, mustWork = FALSE), version = stata_file_format(version)): Writing failure: A provided string value was longer than the available storage size of the specified column.
@evanmiller
Copy link
Collaborator

strLs are supported in the underlying ReadStat library (as READSTAT_TYPE_STRING_REF) - it will be up to the haven authors to implement support on the R side.

@hadley
Copy link
Member

hadley commented Nov 6, 2019

@evanmiller what would support look like? We'd need to detect if a string was longer than some threshold, and if so, change variable types?

@hadley hadley added the feature a feature request or enhancement label Nov 6, 2019
@evanmiller
Copy link
Collaborator

@hadley Probably something like that. The relevant threshold depends on the file version

#define DTA_OLD_MAX_WIDTH    128
#define DTA_111_MAX_WIDTH    244
#define DTA_117_MAX_WIDTH   2045

Then use

readstat_string_ref_t *readstat_add_string_ref(readstat_writer_t *writer,
    const char *string);

readstat_error_t readstat_insert_string_ref(readstat_writer_t *writer,
    const readstat_variable_t *variable, readstat_string_ref_t *ref);

If a string is repeated you can re-use the reference, so some kind of hash table of string refs might make sense (but it's not necessary).

@hadley
Copy link
Member

hadley commented Nov 6, 2019

Makes sense; I'm unlikely to do this in the near term, but it makes sense to do in the future.

@Ales-G

This comment has been minimized.

@hadley
Copy link
Member

hadley commented Apr 9, 2021

Is there a reason to not use strL variables by default? (assuming version > 13)

@hadley
Copy link
Member

hadley commented Apr 10, 2021

@evanmiller should I need to do more than this for it to work?

// when creating variable
readstat_add_variable(writer_, name, READSTAT_TYPE_STRING_REF, 0);

// when inserting values
readstat_insert_string_ref(writer_, var, readstat_add_string_ref(writer_, val));

It writes fine, but when I read I get "Invalid file, or file has unsupported features."

@evanmiller
Copy link
Collaborator

@hadley I'd need to see the whole program; don't forget the begin row / end row stuff and also check the return value of readstat_insert_string_ref.

@hadley
Copy link
Member

hadley commented Apr 10, 2021

You can see my efforts so far in https://github.com/tidyverse/haven/pull/584/files (ignore the unordered_map stuff; I'm not using that yet). Basic reprex showing that short strings work:

library(haven)

long_string <- function(n, m) {
  do.call("paste0", replicate(m, sample(LETTERS, n, TRUE), simplify = FALSE))
}

df1 <- data.frame(x = long_string(10, 3000), stringsAsFactors = FALSE)
df2 <- data.frame(x = long_string(10, 2000), stringsAsFactors = FALSE)

path <- tempfile()
write_dta(df1, path); read_dta(path)
#> Error: Failed to parse /private/tmp/RtmpbcBlqL/file22a13813e594: Invalid file, or file has unsupported features.
write_dta(df2, path); read_dta(path)
#> # A tibble: 10 x 1
#>    x                                                                            
#>    <chr>                                                                        
#>  1 EHKANXLOEWRBYWATCSFBTVNYPEGACKDXOINTAPZXEDQIFBWVPTRHRGFLLHPSZXJSYVQQJMEQEKVE…
#>  2 BAPXNRUZDAIBWWGZBRNTVYZAGVYPZKHENVLXELZHIRODCGMJWYTHICPQSQUPCECJEYUYZKJDUYCU…
#>  3 IYQKLASDVPJYJUCGVEFKRXYCKCIAPXYTUASZACLJAMJQDDSBJNWGWGAHIYAGSJYLSUWCMOXTXPOS…
#>  4 HQTJUVYZQILDACSAREXAHIJSPBHTUKQDRYEWUGCMWVCOCVKIZYCGUZAJOWZCXYWLQTZOLAGNKQRW…
#>  5 QZCREAKKWDQGQRJPZBNNHXHYQVYNIINLOBOVQPEVXERRAZOHRQJKNJPTSWOASIGTKJXQIWFJWKWJ…
#>  6 WAWYXUGPVHCYKVDBKGXSSEMASLOZRQKQWQCASOLONEOZBPPBZKPYMUECSSWFLFUFFYNDAOUOAGTN…
#>  7 EQOWRNDEBBSXZBSBREUNJWIOZFIERHFUWXGFRJOTHMYWIKXQWAXQQGTKOAZCNJTOQAAFKQXWLQDX…
#>  8 GHJDZYJXMSUXCOOBKGXMZNGOPRBQZNJROXRPZOCVQCHFEAHTVBLUSBNICZLMLDHYOHMSZKECWNRN…
#>  9 BDRJURSKGNAKWFFFAHAWPBSPVQSAQVPOWHYSREZVJIAYWUSZYRUSNLRPZNNWEVGTSTTXAEXZSRFY…
#> 10 ESCPCHPKHJWWWICVJUCAMKKHJPOUQFEOFDIKATKBYNGSXFRPDTERQEYOGNMSIJWBXULVXBMDFLOP…

Created on 2021-04-10 by the reprex package (v2.0.0)

@arthurgailes
Copy link

Is there a reason to not use strL variables by default? (assuming version > 13)

For short strings, strL can consume upwards of 10x as much memory. In addition, they throw errors in some Stata commands, e.g. merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants