My data looks like that:
library(data.table)
fake_data <- c(
"A002JOHN 22M",
"A003JAMES25M",
"B234LEE 28M"
)
It's a text file where I have to separate the text in columns according to the number of characters. For example, the first 4 characters are one column, the next 5 characters are another column, etc.
Expected output:
V1 V2 V3 V4
1 A002 JOHN 22 M
2 A003 JAMES 25 M
3 B234 LEE 28 M
The method I use for now is the following:
method_1 <- function() {
tmp <- list()
for (i in seq_along(fake_data)) {
tmp[[i]] <- substring(fake_data[i], c(1, 5, 10, 12), c(4, 9, 11, 12))
}
do.call(rbind, tmp) %>%
as.data.frame()
}
I thought I could replace do.call
and as.data.frame
with the following:
method_2 <- function() {
tmp <- list()
for (i in seq_along(fake_data)) {
tmp[[i]] <- substring(fake_data[i], c(1, 5, 10, 12), c(4, 9, 11, 12))
}
setDT(transpose(tmp))
}
But it makes it slower:
microbenchmark::microbenchmark(
method_1(),
method_2(),
times = 1000
)
> Unit: microseconds
expr min lq mean median uq max neval
method_1() 50.4 61.4 83.8550 68.8 75.55 4348.9 1000
method_2() 94.2 108.9 141.1236 115.6 127.30 1211.8 1000
Any idea of how I could improve the speed for this operation?
CodePudding user response:
I don't think my solution is faster than yours in the post, but this might be another option maybe you would be interested
read.table(
text = gsub(
"(\\D\\d )(\\D )(\\d )(.*)",
"\\1 \\2 \\3 \\4",
fake_data
)
)
which gives
V1 V2 V3 V4
1 A002 JOHN 22 M
2 A003 JAMES 25 M
3 B234 LEE 28 M
Explanation
In gsub
, we use ()
to group the patterns we want to put into the data frame by columns.
(\\D\\d )
: non-digit character followed by digits(\\D )
: non-digit characters in a row(\\d )
: digit characters in a row(.*)
: anything left
CodePudding user response:
You can use tidyr::separate
:
library(tidyr)
fake_data <- data.frame(
V1=c("A002JOHN 22M",
"A003JAMES25M",
"B234LEE 28M"
))
fake_data |>
separate(V1, into=c("V1", "V2", "V3", "V4"), sep=c(4,9,11))
#> V1 V2 V3 V4
#> 1 A002 JOHN 22 M
#> 2 A003 JAMES 25 M
#> 3 B234 LEE 28 M
Created on 2021-12-14 by the reprex package (v2.0.1)
If your fields do not have fixed width, you can use regular expression within sep=
.