Putting data into specific row order in R-CodePudding

I have inherited a mess of a table which looks a little like this:

   A1     B1     C1
V1 B1 0/1 C1 0/0 A1 1/1
V2 C1 0/1 A1 0/0 B1 0/1
V3 B1 0/0 NA     A1 0/1
etc

In reality there are many more columns and rows

I would like to shuffle each row around so they fall in the correct column, matching across rows to a specific column by name but not move the data vertically, only horizontally as each row is specific to the V number.The desired output would be:

   A1     B1     C1
V1 A1 1/1 B1 0/1 C1 0/0
V2 A1 0/0 B1 0/1 C1 0/1
V3 A1 0/1 B1 0/0 NA

How would one go about doing this in R? Is a link to a subset of the real data:

dput(t[1:5,21:27])
structure(list(`32-AA-0003` = c("32-PA-0002 0/1", "32-AA-0003 0/1", 
"32-PA-0006 0/1", "32-GP-0001 0/1", "32-AA-0003 1/1"), `32-DA-0007` = c("", 
"", "", "32-JA-0004 0/1", "32-DA-0007 1/1"), `32-GP-0001` = c("", 
"", "", "32-PA-0002 0/1", "32-GP-0001 0/1"), `32-JA-0004` = c("", 
"", "", "32-PA-0006 0/1", "32-JA-0004 0/1"), `32-MA-0005` = c("", 
"", "", "", "32-MA-0005 1/1"), `32-PA-0002` = c("", "", "", "", 
"32-PA-0002 0/1"), `32-PA-0006` = c("", "", "", "", "32-PA-0006 0/1"
)), row.names = c(NA, -5L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x557151b3b940>)

Many thanks

CodePudding user response：

You can do:

setNames(as.data.frame(t(apply(t, 1, function(x) {
  sapply(names(t), function(y) {
     z <- grep(y, x)
     if(length(z) == 0) '' else x[z]
    })
}))), names(t))
#>       32-AA-0003     32-DA-0007     32-GP-0001     32-JA-0004     32-MA-0005
#> 1                                                                           
#> 2 32-AA-0003 0/1                                                            
#> 3                                                                           
#> 4                               32-GP-0001 0/1 32-JA-0004 0/1               
#> 5 32-AA-0003 1/1 32-DA-0007 1/1 32-GP-0001 0/1 32-JA-0004 0/1 32-MA-0005 1/1
#>       32-PA-0002     32-PA-0006
#> 1 32-PA-0002 0/1               
#> 2                              
#> 3                32-PA-0006 0/1
#> 4 32-PA-0002 0/1 32-PA-0006 0/1
#> 5 32-PA-0002 0/1 32-PA-0006 0/1

Explanation

The way this works is as follows. If we take the first row in the data frame, we can find out which (if any) of the entries belong to the first column by seeing whether any of them contain the name of the first column. We use grep for this, so we can do

grep(names(t)[1], t[1, ])
#> integer(0)

Which tells us that none of the strings in the first row should be in the first column. Since we want an empty string rather than an integer(0) if there are no matches for a particular cell, we need to convert any empty matches to a ''. We can do this by writing

 z <- grep(names(t)[1], t[1, ])
 if(length(z) == 0) '' else x[z]

which will return the matched string if there is one, and an empty string otherwise.

We can do this for each column, so that we can find out which column each cell in the row belongs to. We could write this as a loop:

result <- character(0)

for(i in 1:length(names(t))) {
  z <- grep(names(t)[i], t[1, ])
  result[i] <- if(length(z) == 0) '' else t[1, z]
}

result
#> [1] ""               ""               ""               ""              
#> [5] ""               "32-PA-0002 0/1" ""

But a more concise way to do this in R is to use sapply, so that we don't need to use a storage vector (result), and don't need to use the i to index:

sapply(names(t), function(y) {
  z <- grep(y, t[1, ])
  if(length(z) == 0) '' else t[1, z]
})

Now the problem is that this only gives us the result for a single row of our data frame, but we want a result for each row. We can use apply to specify that we want to pass one row at a time, and this means we don't need to loop through each t[1,], t[2,] etc individually. We just need to specify the data frame t, and the margin we want to operate on (1, i.e. row-wise):

apply(t, 1, function(x) {
  sapply(names(t), function(y) {
     z <- grep(y, x)
     if(length(z) == 0) '' else x[z]
    })
})
#>      [,1]             [,2]             [,3]             [,4]            
#> [1,] ""               "32-AA-0003 0/1" ""               ""              
#> [2,] ""               ""               ""               ""              
#> [3,] ""               ""               ""               "32-GP-0001 0/1"
#> [4,] ""               ""               ""               "32-JA-0004 0/1"
#> [5,] ""               ""               ""               ""              
#> [6,] "32-PA-0002 0/1" ""               ""               "32-PA-0002 0/1"
#> [7,] ""               ""               "32-PA-0006 0/1" "32-PA-0006 0/1"
#>      [,5]            
#> [1,] "32-AA-0003 1/1"
#> [2,] "32-DA-0007 1/1"
#> [3,] "32-GP-0001 0/1"
#> [4,] "32-JA-0004 0/1"
#> [5,] "32-MA-0005 1/1"
#> [6,] "32-PA-0002 0/1"
#> [7,] "32-PA-0006 0/1"

Now we have a different problem, in that although we have a matrix with all the results, it is transposed. To reverse the transpose we use the function t (this is confusing because you chose t for the name of your data frame, which isn't a great idea for this very reason). Finally, we want to convert the transposed matrix into a data frame using as.data.frame, and use setNames to give this data frame the same names as our original data frame. So the final code is:

setNames(as.data.frame(t(apply(t, 1, function(x) {
  sapply(names(t), function(y) {
     z <- grep(y, x)
     if(length(z) == 0) '' else x[z]
    })
}))), names(t))