Home > Software design >  De-aggregate a data frame
De-aggregate a data frame

Time:02-11

There have been many similar questions (e.g. Repeat each row of data.frame the number of times specified in a column, De-aggregate / reverse-summarise / expand a dataset in R, Repeating rows of data.frame in dplyr), but my data set is of a different structure than the answers to these questions assume.

I have a data frame with the frequencies of measurements within each group and the total number of observations for each outcome per group total_N:

tibble(group=c("A", "B"), total_N=c(4,5), measure_A=c(1,4), measure_B=c(2,3))
# A tibble: 2 x 4
  group total_N outcome_A outcome_B
  <chr>   <dbl>     <dbl>     <dbl>
1 A           4         1         2
2 B           5         4         3

I want to de-aggregate the data, so that the data frame has as many rows as total observations and each outcome has a 1 for all observations with the outcome and a 0 for all observations without the outcome. Thus the final result should be a data frame like this:

# A tibble: 9 x 3
  group outcome_A outcome_B
  <chr>     <dbl>     <dbl>
1 A             1         1
2 A             0         1
3 A             0         0
4 A             0         0
5 B             1         1
6 B             1         1
7 B             1         1
8 B             1         0
9 B             0         0

As the aggregated data does not contain any information about the frequency of combinations (i.e., the correlation) of outcome_A and outcome_B, this can be ignored.

CodePudding user response:

Here's a tidyverse solution.

As you say, it's easy to repeat a row an arbitrary number of times. If you know that row_number() counts rows within groups when a data frame is grouped, then it's easy to convert grouped counts to presence/absence flags. across gives you a way to succinctly convert multiple count columns.

library(tidyverse)

tibble(group=c("A", "B"), total_N=c(4,5), measure_A=c(1,4), measure_B=c(2,3)) %>% 
  uncount(total_N) %>% 
  group_by(group) %>% 
  mutate(
    across(
      starts_with("measure"), 
      function(x) as.numeric(row_number() <= x)
    )
  ) %>%
  ungroup()
# A tibble: 9 × 3
  group measure_A measure_B
  <chr>     <dbl>     <dbl>
1 A             1         1
2 A             0         1
3 A             0         0
4 A             0         0
5 B             1         1
6 B             1         1
7 B             1         1
8 B             1         0
9 B             0         0

As you say, this approach takes no account of correlations between the outcome columns, as this cannot be deduced from the grouped data.

  • Related