Home > Software engineering >  Group Columns Based on Values in Second Table
Group Columns Based on Values in Second Table

Time:05-17

I have a df with 6 columns and >5000 rows. I need to group the columns based on information in a second table (samples), and then get the mean values for each group and place into a new dataframe.

Column names will not always be the same or structured as shown below: it is necessary to group based on the values in the second table.

I've searched the forums but I don't know the terminology for what I'm trying to accomplish and have come up empty handed.

Thanks for any help!

>head(df,3)
|        | Control_Rep1 | Ethanol_Rep1 | Control_Rep2 | Ethanol_Rep2 | Control_Rep3 | Ethanol_Rep3 |
|--------|--------------|--------------|--------------|--------------|--------------|--------------|
| Q0120  | 22           | 29           | 25           | 39           | 13           | 23           |
| R0010W | 3694         | 6205         | 3322         | 7110         | 4985         | 10513        |
| R0020C | 3024         | 3564         | 2799         | 4191         | 5030         | 6214         |


>samples
| Identifier   | Treatment |
|--------------|-----------|
| Control_Rep1 | Control   |
| Ethanol_Rep1 | Ethanol   |
| Control_Rep2 | Control   |
| Ethanol_Rep2 | Ethanol   |
| Control_Rep3 | Control   |
| Ethanol_Rep3 | Ethanol   |


>Desired_Table
|        | Control    | Ethanol    |
|--------|------------|------------|
| Q0120  | 20         | 30.3333333 |
| R0010W | 4000.33333 | 7942.66667 |
| R0020C | 3617.66667 | 4656.33333 |

CodePudding user response:

We could split the 'df' based on the matching between the column names of 'df' and 'samples' 'Identifier' and then take the rowMeans

sapply(split.default(df,  samples$Treatment[match(names(df), 
    samples$Identifier)]), rowMeans, na.rm = TRUE)

-output

        Control    Ethanol
Q0120    20.000   30.33333
R0010W 4000.333 7942.66667
R0020C 3617.667 4656.33333

data

df <- structure(list(Control_Rep1 = c(22L, 3694L, 3024L), Ethanol_Rep1 = c(29L, 
6205L, 3564L), Control_Rep2 = c(25L, 3322L, 2799L), Ethanol_Rep2 = c(39L, 
7110L, 4191L), Control_Rep3 = c(13L, 4985L, 5030L), Ethanol_Rep3 = c(23L, 
10513L, 6214L)), class = "data.frame", row.names = c("Q0120", 
"R0010W", "R0020C"))

samples <- structure(list(Identifier = c("Control_Rep1", "Ethanol_Rep1", 
"Control_Rep2", "Ethanol_Rep2", "Control_Rep3", "Ethanol_Rep3"
), Treatment = c("Control", "Ethanol", "Control", "Ethanol", 
"Control", "Ethanol")), class = "data.frame", row.names = c(NA, 
-6L))
  • Related