I have a dataset where each name
value is repeated a certain number of times. I would like to reorder the rows, so that instead of having each name
repeated, I have a repetition of the full sequence.
For example, if I have a dataset like this:
test <- data.frame(
name = rep(c("a", "b", "c"), each = 3),
value = 1:9
)
> test_data
name value
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 c 7
8 c 8
9 c 9
then I want to have the rows reordered so that I have repeated sequences of a-b-c:
name value
1 a 1
2 b 4
3 c 7
4 a 2
5 b 5
6 c 8
7 a 3
8 b 6
9 c 9
Here's what I have so far:
# split by value in name
split_data <- split(test_data, ~ name)
# in each split dataset, create id2
list_data <- lapply(seq_along(split_data), function(x) {
new_id <- x (1:nrow(split_data[[x]]) - 1) * length(split_data)
split_data[[x]]$id2 <- new_id
return(split_data[[x]])
})
# bind the split data back, and order by this new id2
out <- do.call(rbind, list_data)
out <- out[order(out$id2), ]
out$id2 <- NULL
out
name value
1 a 1
4 b 4
7 c 7
2 a 2
5 b 5
8 c 8
3 a 3
6 b 6
9 c 9
This works, but I'm now looking for a more efficient way to do this both in terms of time and memory. Here's the performance of this code with a few million rows:
test_data <- data.frame(
name = rep(c("a", "b", "c"), each = 1000000),
value = 1:3000000
)
dim(test_data)
#> [1] 3000000 2
my_test <- function() {
split_data <- split(test_data, ~ name)
list_data <- lapply(seq_along(split_data), function(x) {
new_id <- x (1:nrow(split_data[[x]]) - 1) * length(split_data)
split_data[[x]]$id2 <- new_id
return(split_data[[x]])
})
out <- do.call(rbind, list_data)
out <- out[order(out$id2), ]
out$id2 <- NULL
}
bench::mark(
my_test()
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 my_test() 907ms 907ms 1.10 541MB 7.72
Is there a more efficient way to do the same thing? I'm looking for a solution in base R only.
Note: I put a very simple sequence of numbers for value
. This is just for the example. The solution shouldn't depend on the values in value
.
Edit to compare Mael's output:
test_data <- data.frame(
name = rep(c("a", "b", "c"), each = 10),
value = 1:30
)
mael <- function() {
out <- test_data[order(ave(test_data$name, test_data$name, FUN = seq_along)), ]
row.names(out) <- NULL
out
}
head(my_test())
#> name value
#> 1 a 1
#> 2 b 11
#> 3 c 21
#> 4 a 2
#> 5 b 12
#> 6 c 22
head(mael())
#> name value
#> 1 a 1
#> 2 b 11
#> 3 c 21
#> 4 a 10
#> 5 b 20
#> 6 c 30
CodePudding user response:
If the order of name
is guaranteed:
test_data[c(matrix(seq_len(nrow(test_data)),
nrow=length(unique(test_data$name)),
byrow=T)), ]
CodePudding user response:
With order
ave
seq_along
:
test[order(ave(test$name, test$name, FUN = seq_along)), ]
test[order(ave(as.numeric(factor(test$name)), test$name, FUN = seq_along)), ]
# name value
# 1 a 1
# 4 b 4
# 7 c 7
# 2 a 2
# 5 b 5
# 8 c 8
# 3 a 3
# 6 b 6
# 9 c 9
Also works with the wrapper rowid
from data.table
:
library(data.table)
test[order(rowid(test$name)), ]
CodePudding user response:
If you are willing to use external packages:
test[order(rowid(test$name)), ]
# name value
# 1 a 1
# 4 b 4
# 7 c 7
# 2 a 2
# 5 b 5
# 8 c 8
# 3 a 3
# 6 b 6
# 9 c 9