Home > Enterprise >  How to repeat rows by their value by multiple columns and divide back
How to repeat rows by their value by multiple columns and divide back

Time:10-13

Let's say I have this dataframe:

> df <- data.frame(A=1:5, B=c(0, 0, 3, 0, 0), C=c(1, 0, 0, 1, 0), D=c(0, 2, 0, 0, 1))
> df
  A B C D
1 1 0 1 0
2 2 0 0 2
3 3 3 0 0
4 4 0 1 0
5 5 0 0 1

How would I go about converting it to:

  A B C D
1 1 0 1 0
2 2 0 0 1
3 2 0 0 1
4 3 1 0 0
5 3 1 0 0
6 3 1 0 0
7 4 0 1 0
8 5 0 0 1

As you can see there are value 2 and 3, I want to repeat them by that length and change the values back to 1. How would I do that?

I also want to duplicate the the A column as you can see.

I tried:

replace(df[rep(rownames(df), select(df, -A)),], 2, 1)

But it gives me an error.

CodePudding user response:

One option would be to get max value from columns B, C and D using pmax, use uncount to repeat the rows. Use pmin to replace the values greater than 1 to 1.

library(dplyr)
library(tidyr)

df %>%
  mutate(repeat_row = pmax(B, C, D)) %>%
  uncount(repeat_row) %>%
  mutate(across(-A, pmin, 1))

#  A B C D
#1 1 0 1 0
#2 2 0 0 1
#3 2 0 0 1
#4 3 1 0 0
#5 3 1 0 0
#6 3 1 0 0
#7 4 0 1 0
#8 5 0 0 1

CodePudding user response:

Apparently, there's just one value > 0 in columns B to D, so we can exploit the partial rowSums for a replicate call on columns B to D binarized using > 0. So that we can use this in Map, we transpose twice. Rest is cosmetics.

t(do.call(cbind, Map(replicate,
                     rowSums(df[-1]), 
                     as.data.frame(t(cbind(df[1], df[-1] > 0)))))) |>
  as.data.frame() |>
  setNames(names(df))
#   A B C D
# 1 1 0 1 0
# 2 2 0 0 1
# 3 2 0 0 1
# 4 3 1 0 0
# 5 3 1 0 0
# 6 3 1 0 0
# 7 4 0 1 0
# 8 5 0 0 1

Note: R>=4.1 used.

CodePudding user response:

Just to modify Ronak Shah's answer a bit, I realized you could simply just do it with only dplyr:

library(dplyr)

df[rep(rownames(df), apply(select(df, -A), 1, max)),] %>%
  as.data.frame(row.names=1:nrow(.)) %>%
  mutate(across(-A, pmin, 1))

Output:

  A B C D
1 1 0 1 0
2 2 0 0 1
3 2 0 0 1
4 3 1 0 0
5 3 1 0 0
6 3 1 0 0
7 4 0 1 0
8 5 0 0 1

Or with rowSums:

library(dplyr)

df[rep(rownames(df), rowSums(select(df, -A)),] %>%
  as.data.frame(row.names=1:nrow(.)) %>%
  mutate(across(-A, pmin, 1))
  • Related