-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write_dta support for long string (strL)? #437
Comments
strLs are supported in the underlying ReadStat library (as |
@evanmiller what would support look like? We'd need to detect if a string was longer than some threshold, and if so, change variable types? |
@hadley Probably something like that. The relevant threshold depends on the file version #define DTA_OLD_MAX_WIDTH 128
#define DTA_111_MAX_WIDTH 244
#define DTA_117_MAX_WIDTH 2045 Then use readstat_string_ref_t *readstat_add_string_ref(readstat_writer_t *writer,
const char *string);
readstat_error_t readstat_insert_string_ref(readstat_writer_t *writer,
const readstat_variable_t *variable, readstat_string_ref_t *ref); If a string is repeated you can re-use the reference, so some kind of hash table of string refs might make sense (but it's not necessary). |
Makes sense; I'm unlikely to do this in the near term, but it makes sense to do in the future. |
This comment has been minimized.
This comment has been minimized.
Is there a reason to not use |
@evanmiller should I need to do more than this for it to work? // when creating variable
readstat_add_variable(writer_, name, READSTAT_TYPE_STRING_REF, 0);
// when inserting values
readstat_insert_string_ref(writer_, var, readstat_add_string_ref(writer_, val)); It writes fine, but when I read I get "Invalid file, or file has unsupported features." |
@hadley I'd need to see the whole program; don't forget the begin row / end row stuff and also check the return value of |
You can see my efforts so far in https://github.com/tidyverse/haven/pull/584/files (ignore the library(haven)
long_string <- function(n, m) {
do.call("paste0", replicate(m, sample(LETTERS, n, TRUE), simplify = FALSE))
}
df1 <- data.frame(x = long_string(10, 3000), stringsAsFactors = FALSE)
df2 <- data.frame(x = long_string(10, 2000), stringsAsFactors = FALSE)
path <- tempfile()
write_dta(df1, path); read_dta(path)
#> Error: Failed to parse /private/tmp/RtmpbcBlqL/file22a13813e594: Invalid file, or file has unsupported features.
write_dta(df2, path); read_dta(path)
#> # A tibble: 10 x 1
#> x
#> <chr>
#> 1 EHKANXLOEWRBYWATCSFBTVNYPEGACKDXOINTAPZXEDQIFBWVPTRHRGFLLHPSZXJSYVQQJMEQEKVE…
#> 2 BAPXNRUZDAIBWWGZBRNTVYZAGVYPZKHENVLXELZHIRODCGMJWYTHICPQSQUPCECJEYUYZKJDUYCU…
#> 3 IYQKLASDVPJYJUCGVEFKRXYCKCIAPXYTUASZACLJAMJQDDSBJNWGWGAHIYAGSJYLSUWCMOXTXPOS…
#> 4 HQTJUVYZQILDACSAREXAHIJSPBHTUKQDRYEWUGCMWVCOCVKIZYCGUZAJOWZCXYWLQTZOLAGNKQRW…
#> 5 QZCREAKKWDQGQRJPZBNNHXHYQVYNIINLOBOVQPEVXERRAZOHRQJKNJPTSWOASIGTKJXQIWFJWKWJ…
#> 6 WAWYXUGPVHCYKVDBKGXSSEMASLOZRQKQWQCASOLONEOZBPPBZKPYMUECSSWFLFUFFYNDAOUOAGTN…
#> 7 EQOWRNDEBBSXZBSBREUNJWIOZFIERHFUWXGFRJOTHMYWIKXQWAXQQGTKOAZCNJTOQAAFKQXWLQDX…
#> 8 GHJDZYJXMSUXCOOBKGXMZNGOPRBQZNJROXRPZOCVQCHFEAHTVBLUSBNICZLMLDHYOHMSZKECWNRN…
#> 9 BDRJURSKGNAKWFFFAHAWPBSPVQSAQVPOWHYSREZVJIAYWUSZYRUSNLRPZNNWEVGTSTTXAEXZSRFY…
#> 10 ESCPCHPKHJWWWICVJUCAMKKHJPOUQFEOFDIKATKBYNGSXFRPDTERQEYOGNMSIJWBXULVXBMDFLOP… Created on 2021-04-10 by the reprex package (v2.0.0) |
For short strings, strL can consume upwards of 10x as much memory. In addition, they throw errors in some Stata commands, e.g. |
Uses strL for strings longer than `strl_threshold` in `write_dta()`, closing #437 Co-authored-by: Danny Smith <danny@gorcha.org>
I need to process a .json containing logdata into a .dta file. The file itself contains a handful of string variables exceeding 5000 and more characters.
As Stata (since Stata 13) supports long string variables (strL variable) and haven() supports long string at least in write_sav() I wanted to ask if it is in any way possible to write strL-variables with haven (haven't found anything on this)? If not if there's support for this planned in the future?
Bests
The text was updated successfully, but these errors were encountered: