Home > Net >  How to reshape using id variable as final value, when id has duplicates
How to reshape using id variable as final value, when id has duplicates

Time:10-20

I've struggled a lot to get this right in R. My data looks like this (real data has more than 120k observations):

df <- data.frame(
  Input = c(1,2,3,4,4,5,1,3,4),
  Output = c(91,91,91,91,91,91,92,92,92)
)

df
  Input Output
1     1     91
2     2     91
3     3     91
4     4     91
5     4     91
6     5     91
7     1     92
8     3     92
9     4     92

The numbers are codes for physical stuff. Key here is that product 91 uses 5 inputs whereas product 92 uses only 3, AND there are duplicates in 91 (two 4s)

I want the data frame to have outputs as column names and inputs as values (disregarding duplicates):

       91         92
       1          1
       2          NA
       3          3
       4          4
       5          NA

So, I did a reshape from long to wide using

df2 <- reshape(df, idvar = "Input", v.names = "Output", timevar = "Output", direction = "wide", sep = "_")

This takes lots of time and haven't been able to make other code from this post to work. It yields an intermediate step:

  Input Output_91 Output_92
1     1        91        92
2     2        91        NA
3     3        91        92
4     4        91        92
5     5        91        NA

Then, all I have to do is to replace each Output column with the first column, except if NA. I can do this trivially for one column at a time. For instance:

df2$Output_92[!is.na(df2$Output_92)] <- df2$Input[!is.na(df2$Output_92)]

I've been trying to make a loop over columns to iterate over all 2 columns, mimicking the above command. Something like:

for(i in colnames(df2)){
        df2[!is.na(i)][i] <- df2$Input[!is.na(i)][i]
      }

This does not work.

So, in principle I need help either only in this loop (title of question). But hints optimising the reshaping, or perhaps a simpler way to do the whole thing are more than welcome.

UPDATE: I realised duplicates was key to my problem, since without these, standard solutions in the linked post work fine. Updated question accordingly. The answer below helps with cases where ID has duplicates.

CodePudding user response:

If you have duplicate inputs for a given output, then assuming you don't care to count them or treat them any differently, then all methods below work by replacing df with unique(df). (One can also use dplyr::distinct if preferred.)

Once you've resolved uniqueness, then ... this is "just" reshaping/pivoting from long to wide.

base R

stats::reshape is a little hard to work with, and it requires some things that are not uniquely present. For example, it requires idvar and the v.names variables to be unique columns (so we duplicate Input):

df$Input2 <- df$Input
out <- reshape(unique(df), idvar = "Input", v.names = "Input2", timevar = "Output", direction = "wide")
out
#   Input Input2.91 Input2.92
# 1     1         1         1
# 2     2         2        NA
# 3     3         3         3
# 4     4         4         4
# 6     5         5        NA
names(out) <- gsub("Input2\\.", "", names(out))
# out[,-1]
  91 92
# 1  1  1
# 2  2 NA
# 3  3  3
# 4  4  4
# 6  5 NA

tidyr

(Ditto on the duplicated column.)

library(dplyr)
library(tidyr) # pivot_wider
df %>%
  distinct() %>%
   mutate(Input2 = Input) %>%
   pivot_wider(Input, names_from = "Output", values_from = "Input2") %>%
   select(-Input)
# # A tibble: 5 x 2
#    `91`  `92`
#   <dbl> <dbl>
# 1     1     1
# 2     2    NA
# 3     3     3
# 4     4     4
# 5     5    NA

data.table

No such requirement on needing "Input2" here.

library(data.table)
dcast(unique(as.data.table(df)), Input ~ Output, value.var = "Input")[,-1]
#       91    92
#    <num> <num>
# 1:     1     1
# 2:     2    NA
# 3:     3     3
# 4:     4     4
# 5:     5    NA

Here, unique can go inside or outside as.data.table(.), your choice.

CodePudding user response:

Trivially, since I am not interested in duplicates, I just had to remove them beforehand, after which code in this post works fine.

Remove with df <- df[!duplicated(df[c("Output","Input")]),]

  • Related