Home > Software engineering >  Efficient way to reorder rows to have a repeated sequence?
Efficient way to reorder rows to have a repeated sequence?

Time:10-07

I have a dataset where each name value is repeated a certain number of times. I would like to reorder the rows, so that instead of having each name repeated, I have a repetition of the full sequence.

For example, if I have a dataset like this:

test <- data.frame(
  name = rep(c("a", "b", "c"), each = 3),
  value = 1:9
)

> test_data
  name value
1    a     1
2    a     2
3    a     3
4    b     4
5    b     5
6    b     6
7    c     7
8    c     8
9    c     9

then I want to have the rows reordered so that I have repeated sequences of a-b-c:

  name value
1    a     1
2    b     4
3    c     7
4    a     2
5    b     5
6    c     8
7    a     3
8    b     6
9    c     9

Here's what I have so far:

# split by value in name
split_data <- split(test_data, ~ name)

# in each split dataset, create id2 
list_data <- lapply(seq_along(split_data), function(x) {
  new_id <- x   (1:nrow(split_data[[x]]) - 1) * length(split_data)
  split_data[[x]]$id2 <- new_id
  return(split_data[[x]])
})

# bind the split data back, and order by this new id2
out <- do.call(rbind, list_data)
out <- out[order(out$id2), ]
out$id2 <- NULL
out

  name value
1    a     1
4    b     4
7    c     7
2    a     2
5    b     5
8    c     8
3    a     3
6    b     6
9    c     9

This works, but I'm now looking for a more efficient way to do this both in terms of time and memory. Here's the performance of this code with a few million rows:

test_data <- data.frame(
  name = rep(c("a", "b", "c"), each = 1000000),
  value = 1:3000000
)
dim(test_data)
#> [1] 3000000       2

my_test <- function() {
  split_data <- split(test_data, ~ name)
  
  list_data <- lapply(seq_along(split_data), function(x) {
    new_id <- x   (1:nrow(split_data[[x]]) - 1) * length(split_data)
    split_data[[x]]$id2 <- new_id
    return(split_data[[x]])
  })
  
  out <- do.call(rbind, list_data)
  out <- out[order(out$id2), ]
  out$id2 <- NULL
  
}

bench::mark(
  my_test()
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 my_test()     907ms    907ms      1.10     541MB     7.72

Is there a more efficient way to do the same thing? I'm looking for a solution in base R only.

Note: I put a very simple sequence of numbers for value. This is just for the example. The solution shouldn't depend on the values in value.

Edit to compare Mael's output:

test_data <- data.frame(
  name = rep(c("a", "b", "c"), each = 10),
  value = 1:30
)

mael <- function() {
  out <- test_data[order(ave(test_data$name, test_data$name, FUN = seq_along)), ]
  row.names(out) <- NULL
  out
}

head(my_test())
#>   name value
#> 1    a     1
#> 2    b    11
#> 3    c    21
#> 4    a     2
#> 5    b    12
#> 6    c    22
head(mael())
#>   name value
#> 1    a     1
#> 2    b    11
#> 3    c    21
#> 4    a    10
#> 5    b    20
#> 6    c    30

CodePudding user response:

If the order of name is guaranteed:

test_data[c(matrix(seq_len(nrow(test_data)), 
                           nrow=length(unique(test_data$name)), 
                           byrow=T)), ]

CodePudding user response:

With order ave seq_along:

test[order(ave(test$name, test$name, FUN = seq_along)), ]
test[order(ave(as.numeric(factor(test$name)), test$name, FUN = seq_along)), ]

#   name value
# 1    a     1
# 4    b     4
# 7    c     7
# 2    a     2
# 5    b     5
# 8    c     8
# 3    a     3
# 6    b     6
# 9    c     9

Also works with the wrapper rowid from data.table:

library(data.table)
test[order(rowid(test$name)), ]

CodePudding user response:

If you are willing to use external packages:

test[order(rowid(test$name)), ]
#   name value
# 1    a     1
# 4    b     4
# 7    c     7
# 2    a     2
# 5    b     5
# 8    c     8
# 3    a     3
# 6    b     6
# 9    c     9
  • Related