Home > other >  collapse dataframe into 4 columns while retaining information
collapse dataframe into 4 columns while retaining information

Time:02-04

I have a dataframe which looks like this (but >10k rows).

df <- data.frame(id=1:2,
                 feature1=1:2,
                 feature2=4:5,
                 feature3=7:8)

And want to reshape it looks something like this:

df_goal <- data.frame(value1=c(1,2,1,2,1,2),
                      value2=c(1,2,4,5,7,8),
                      x=c("feature1","feature1","feature1","feature1","feature1","feature1"),
                      y=c("feature1","feature1","feature2","feature2","feature3","feature3"))

df_goal presents only a subset of the final df that I hope to achieve. Ultimatly, I hope to have every combination of feature1, feature2 and feature3 and the corresponding values.

I want to use the resulting df for ggplot.

I think there should be a straightforward tidyr solution for my problem, but I cant seem to find it.

Any help is much appreciated!

CodePudding user response:

Here's a starting point: pivot the data set, then compute the combnations (unnest and asplit are just here to make it a dataframe in the end).

library(tidyr)
library(dplyr) #requires dplyr 1.1.0

data1 <- 
  pivot_longer(df, -id, names_to = "feature") %>% 
  reframe(across(c(value, feature), ~ asplit(t(combn(.x, 2)), 1)),
          .by = id) %>% 
  unnest_wider(c(value, feature), names_sep = "")

# A tibble: 6 × 5
#     id    value1    value2 feature1  feature2 
#1     1         1         4 feature1  feature2 
#2     1         1         7 feature1  feature3 
#3     1         4         7 feature2  feature3 
#4     2         2         5 feature1  feature2 
#5     2         2         8 feature1  feature3 
#6     2         5         8 feature2  feature3

To get the remaining rows, this could do it:

data2 <- 
  pivot_longer(df, -id, names_to = "feature") %>% 
  mutate(feature1 = feature, feature2 = feature,
         value1 = value, value2 = value, .keep = "unused")

# A tibble: 6 × 5
#     id feature1 feature2 value1 value2
#1     1 feature1 feature1      1      1
#2     1 feature2 feature2      4      4
#3     1 feature3 feature3      7      7
#4     2 feature1 feature1      2      2
#5     2 feature2 feature2      5      5
#6     2 feature3 feature3      8      8

Then, altogether:

bind_rows(data1, data2) %>% 
  arrange(id, feature1, feature2)

# A tibble: 12 × 5
      id    value1    value2 feature1  feature2 
 1     1         1         1 feature1  feature1 
 2     1         1         4 feature1  feature2 
 3     1         1         7 feature1  feature3 
 4     1         4         4 feature2  feature2 
 5     1         4         7 feature2  feature3 
 6     1         7         7 feature3  feature3 
 7     2         2         2 feature1  feature1 
 8     2         2         5 feature1  feature2 
 9     2         2         8 feature1  feature3 
10     2         5         5 feature2  feature2 
11     2         5         8 feature2  feature3 
12     2         8         8 feature3  feature3 

CodePudding user response:

Might be a case of pivoting, renaming and full_joining the dataframe to itself?

df <- data.frame(
  id = 1:2,
  feature1 = 1:2,
  feature2 = 4:5,
  feature3 = 7:8
)

library(tidyverse)

df_longer <- pivot_longer(df,-id)

df_longer |>
  rename(name1 = name, value1 = value) |>
  full_join(df_longer |> rename(name2 = name, value2 = value),
            by = "id") |> 
  select(id, value1, value2, name1, name2) |> 
  arrange(name1, name2)

#> # A tibble: 18 × 5
#>       id value1 value2 name1    name2   
#>    <int>  <int>  <int> <chr>    <chr>   
#>  1     1      1      1 feature1 feature1
#>  2     2      2      2 feature1 feature1
#>  3     1      1      4 feature1 feature2
#>  4     2      2      5 feature1 feature2
#>  5     1      1      7 feature1 feature3
#>  6     2      2      8 feature1 feature3
#>  7     1      4      1 feature2 feature1
#>  8     2      5      2 feature2 feature1
#>  9     1      4      4 feature2 feature2
#> 10     2      5      5 feature2 feature2
#> 11     1      4      7 feature2 feature3
#> 12     2      5      8 feature2 feature3
#> 13     1      7      1 feature3 feature1
#> 14     2      8      2 feature3 feature1
#> 15     1      7      4 feature3 feature2
#> 16     2      8      5 feature3 feature2
#> 17     1      7      7 feature3 feature3
#> 18     2      8      8 feature3 feature3

This seems to match the six rows of the output dataset you've provided above - is this what you're looking for?

  • Related