I want to add empty rows at specific positions of a dataframe. Let's say we have this dataframe:
df <- data.frame(var1 = c(1,2,3,4,5,6,7,8,9),
var2 = c(9,8,7,6,5,4,3,2,1))
In which I want to add an empty row after rows 1, 3 and 5 (I know that this is not best practice in most cases, ultimately I want to create a table using flextable here). These row numbers are saved in a vector:
rows <- c(1,3,5)
Now I want to use a for loop that loops through the rows vector to add an empty row after each row using add_row():
for (i in rows) {
df <- add_row(df, .after = i)
}
The problem is, that while the first iteration works flawlessly, the other empty rows get misplaced, since the dataframe gets obviously longer. To fix this I tried adding 1 to the vector after each iteration:
for (i in rows) {
df <- add_row(df, .after = i)
rows <- rows 1
}
Which does not work. I assume the rows vector does only get evaluated once. Anyone got any ideas?
CodePudding user response:
Do it all at once, no need for looping. Make a sequence of row numbers, add the new rows in, sort, then replace the duplicated row numbers with NA
:
s <- sort(c(seq_len(nrow(df)), rows))
out <- df[s,]
out[duplicated(s),] <- NA
# var1 var2
#1 1 9
#1.1 NA NA
#2 2 8
#3 3 7
#3.1 NA NA
#4 4 6
#5 5 5
#5.1 NA NA
#6 6 4
#7 7 3
#8 8 2
#9 9 1
This will be much more efficient than looping, for even moderately sized data:
df <- df[rep(1:9,1e4),]
rows <- seq(1,9e4,100)
system.time({
for (i in rev(rows)) {
df <- tibble::add_row(df, .after = i)
}
})
# user system elapsed
# 23.94 0.05 23.98
df <- df[rep(1:9,1e4),]
rows <- seq(1,9e4,100)
system.time({
s <- sort(c(seq_len(nrow(df)), rows))
out <- df[s,]
out[duplicated(s),] <- NA
})
# user system elapsed
# 0.00 0.01 0.02
CodePudding user response:
You could achieve your result by looping in the reverse direction:
df <- data.frame(
var1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
var2 = c(9, 8, 7, 6, 5, 4, 3, 2, 1)
)
rows <- c(1, 3, 5)
for (i in rev(rows)) {
df <- tibble::add_row(df, .after = i)
}
df
#> var1 var2
#> 1 1 9
#> 2 NA NA
#> 3 2 8
#> 4 3 7
#> 5 NA NA
#> 6 4 6
#> 7 5 5
#> 8 NA NA
#> 9 6 4
#> 10 7 3
#> 11 8 2
#> 12 9 1
Or via Reduce
you could do
Reduce(function(x, y) tibble::add_row(x, .after = y), rev(rows), init = df)
which is more efficient than the for
-loop but will still not beat the approach suggested by @thelatemail:
microbenchmark::microbenchmark(
for (i in rev(rows)) {
df <- tibble::add_row(df, .after = i)
},
Reduce(function(x, y) tibble::add_row(x, .after = y), rev(rows), init = df),
{
s <- sort(c(seq_len(nrow(df)), rows))
out <- df[s, ]
out[duplicated(s), ] <- NA
out
}
)
#> Unit: microseconds
#> expr
#> for (i in rev(rows)) { df <- tibble::add_row(df, .after = i) }
#> Reduce(function(x, y) tibble::add_row(x, .after = y), rev(rows), init = df)
#> { s <- sort(c(seq_len(nrow(df)), rows)) out <- df[s, ] out[duplicated(s), ] <- NA out }
#> min lq mean median uq max neval
#> 1508.518 1546.4630 1704.5568 1577.0925 1889.981 3595.947 100
#> 237.268 256.1945 319.0439 271.1885 296.632 2257.603 100
#> 126.114 161.6135 204.5679 169.1120 184.000 2035.941 100