-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
143 lines (105 loc) · 3.4 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# deduped
<!-- badges: start -->
[![](https://cranlogs.r-pkg.org/badges/deduped)](https://cran.r-project.org/package=deduped)
<!-- badges: end -->
`deduped` contains one main function `deduped()` which speeds up slow,
vectorized functions by only performing computations on the unique values
of the input and expanding the results at the end.
One particular use case of `deduped()` that I come across a lot is when using `basename()` and `dirname()` on the `file_path` column after reading multiple CSVs (e.g. with `readr::read_csv(..., id="file_path")`). `basename()` and `dirname()` are surprisingly slow (especially on Windows), and most of the column is duplicated.
## Installation
You can install the released version of `deduped` from
[CRAN](https://cran.r-project.org/package=deduped) with:
``` r
install.packages("deduped")
```
And the development version from [GitHub](https://github.com/orgadish/deduped):
``` r
if(!requireNamespace("remotes")) install.packages("remotes")
remotes::install_github("orgadish/deduped")
```
## Examples
### Setup
```{r}
library(deduped)
set.seed(0)
slow_func <- function(ii) {
for (i in ii) {
Sys.sleep(0.0005)
}
}
```
### `deduped()`
```{r example}
unique_vec <- sample(LETTERS, 5)
unique_vec
# Create a vector with significant duplication.
duplicated_vec <- sample(rep(unique_vec, 50))
length(duplicated_vec)
system.time({ x1 <- slow_func(duplicated_vec) })
system.time({ x2 <- deduped(slow_func)(duplicated_vec) })
all.equal(x1, x2)
```
### `deduped(lapply)()`
`deduped()` can also be combined with `lapply()` or `purrr::map()`.
```{r example-map}
unique_list <- lapply(1:3, function(j) sample(LETTERS, j, replace = TRUE))
str(unique_list)
# Create a list with significant duplication.
duplicated_list <- sample(rep(unique_list, 50))
length(duplicated_list)
system.time({ y1 <- lapply(duplicated_list, slow_func) })
system.time({ y2 <- deduped(lapply)(duplicated_list, slow_func) })
all.equal(y1, y2)
```
### Specific example: `deduped(basename)()` on file paths
*Note: Times shown below are based on running R 4.3.2 on Windows 10, for which
`basename()` is known to be slow: [Bug 18597](https://bugs.r-project.org/show_bug.cgi?id=18597).*
```{r file_path_example}
# Create multiple CSVs to read
tf <- withr::local_tempdir()
# Duplicate mtcars 10,000x and write 1 CSV for each value of `am`
duplicated_mtcars <- dplyr::slice(mtcars, rep(1:nrow(mtcars), 10000))
invisible(sapply(
dplyr::group_split(duplicated_mtcars, am),
function(dat) {
file_name <- paste0("mtcars_", unique(dat$am), ".csv")
readr::write_csv(dat, file.path(tf, file_name))
}
))
# Read the separate files back in.
mtcars_files <- list.files(tf, full.names = TRUE)
length(mtcars_files)
duplicated_mtcars_from_files <- readr::read_csv(
mtcars_files,
id = "file_path",
show_col_types = FALSE
)
dplyr::count(duplicated_mtcars_from_files, basename(file_path))
# Original: slow
system.time({
df1 <- dplyr::mutate(
duplicated_mtcars_from_files,
file_name = basename(file_path)
)
})
# Deduped: fast
system.time({
df2 <- dplyr::mutate(
duplicated_mtcars_from_files,
file_name = deduped(basename)(file_path)
)
})
all.equal(df1, df2)
```