Home > Enterprise >  Add empty rows at specific positions of dataframe
Add empty rows at specific positions of dataframe

Time:11-24

I want to add empty rows at specific positions of a dataframe. Let's say we have this dataframe:

df <- data.frame(var1 = c(1,2,3,4,5,6,7,8,9),
     var2 = c(9,8,7,6,5,4,3,2,1))

In which I want to add an empty row after rows 1, 3 and 5 (I know that this is not best practice in most cases, ultimately I want to create a table using flextable here). These row numbers are saved in a vector:

rows <- c(1,3,5)

Now I want to use a for loop that loops through the rows vector to add an empty row after each row using add_row():

for (i in rows) {
df <- add_row(df, .after = i)
}

The problem is, that while the first iteration works flawlessly, the other empty rows get misplaced, since the dataframe gets obviously longer. To fix this I tried adding 1 to the vector after each iteration:

for (i in rows) {
df <- add_row(df, .after = i)
rows <- rows 1
}

Which does not work. I assume the rows vector does only get evaluated once. Anyone got any ideas?

CodePudding user response:

Do it all at once, no need for looping. Make a sequence of row numbers, add the new rows in, sort, then replace the duplicated row numbers with NA:

s <- sort(c(seq_len(nrow(df)), rows))
out <- df[s,]
out[duplicated(s),] <- NA

#    var1 var2
#1      1    9
#1.1   NA   NA
#2      2    8
#3      3    7
#3.1   NA   NA
#4      4    6
#5      5    5
#5.1   NA   NA
#6      6    4
#7      7    3
#8      8    2
#9      9    1

This will be much more efficient than looping, for even moderately sized data:

df <- df[rep(1:9,1e4),]
rows <- seq(1,9e4,100)

system.time({
for (i in rev(rows)) {
  df <- tibble::add_row(df, .after = i)
}
})
#   user  system elapsed 
#  23.94    0.05   23.98 

df <- df[rep(1:9,1e4),]
rows <- seq(1,9e4,100)

system.time({
s <- sort(c(seq_len(nrow(df)), rows))
out <- df[s,]
out[duplicated(s),] <- NA
})
#   user  system elapsed 
#   0.00    0.01    0.02 

CodePudding user response:

You could achieve your result by looping in the reverse direction:

df <- data.frame(
  var1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
  var2 = c(9, 8, 7, 6, 5, 4, 3, 2, 1)
)
rows <- c(1, 3, 5)

for (i in rev(rows)) {
  df <- tibble::add_row(df, .after = i)
}
df
#>    var1 var2
#> 1     1    9
#> 2    NA   NA
#> 3     2    8
#> 4     3    7
#> 5    NA   NA
#> 6     4    6
#> 7     5    5
#> 8    NA   NA
#> 9     6    4
#> 10    7    3
#> 11    8    2
#> 12    9    1

Or via Reduceyou could do

Reduce(function(x, y) tibble::add_row(x, .after = y), rev(rows), init = df)

which is more efficient than the for-loop but will still not beat the approach suggested by @thelatemail:

microbenchmark::microbenchmark(
  for (i in rev(rows)) {
    df <- tibble::add_row(df, .after = i)
  },
  Reduce(function(x, y) tibble::add_row(x, .after = y), rev(rows), init = df),
  {
    s <- sort(c(seq_len(nrow(df)), rows))
    out <- df[s, ]
    out[duplicated(s), ] <- NA
    out
  }
)
#> Unit: microseconds
#>                                                                                                     expr
#>                                       for (i in rev(rows)) {     df <- tibble::add_row(df, .after = i) }
#>                         Reduce(function(x, y) tibble::add_row(x, .after = y), rev(rows),      init = df)
#>  {     s <- sort(c(seq_len(nrow(df)), rows))     out <- df[s, ]     out[duplicated(s), ] <- NA     out }
#>       min        lq      mean    median       uq      max neval
#>  1508.518 1546.4630 1704.5568 1577.0925 1889.981 3595.947   100
#>   237.268  256.1945  319.0439  271.1885  296.632 2257.603   100
#>   126.114  161.6135  204.5679  169.1120  184.000 2035.941   100
  • Related