I have inherited a mess of a table which looks a little like this:
A1 B1 C1
V1 B1 0/1 C1 0/0 A1 1/1
V2 C1 0/1 A1 0/0 B1 0/1
V3 B1 0/0 NA A1 0/1
etc
In reality there are many more columns and rows
I would like to shuffle each row around so they fall in the correct column, matching across rows to a specific column by name but not move the data vertically, only horizontally as each row is specific to the V number.The desired output would be:
A1 B1 C1
V1 A1 1/1 B1 0/1 C1 0/0
V2 A1 0/0 B1 0/1 C1 0/1
V3 A1 0/1 B1 0/0 NA
How would one go about doing this in R? Is a link to a subset of the real data:
dput(t[1:5,21:27])
structure(list(`32-AA-0003` = c("32-PA-0002 0/1", "32-AA-0003 0/1",
"32-PA-0006 0/1", "32-GP-0001 0/1", "32-AA-0003 1/1"), `32-DA-0007` = c("",
"", "", "32-JA-0004 0/1", "32-DA-0007 1/1"), `32-GP-0001` = c("",
"", "", "32-PA-0002 0/1", "32-GP-0001 0/1"), `32-JA-0004` = c("",
"", "", "32-PA-0006 0/1", "32-JA-0004 0/1"), `32-MA-0005` = c("",
"", "", "", "32-MA-0005 1/1"), `32-PA-0002` = c("", "", "", "",
"32-PA-0002 0/1"), `32-PA-0006` = c("", "", "", "", "32-PA-0006 0/1"
)), row.names = c(NA, -5L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x557151b3b940>)
Many thanks
CodePudding user response:
You can do:
setNames(as.data.frame(t(apply(t, 1, function(x) {
sapply(names(t), function(y) {
z <- grep(y, x)
if(length(z) == 0) '' else x[z]
})
}))), names(t))
#> 32-AA-0003 32-DA-0007 32-GP-0001 32-JA-0004 32-MA-0005
#> 1
#> 2 32-AA-0003 0/1
#> 3
#> 4 32-GP-0001 0/1 32-JA-0004 0/1
#> 5 32-AA-0003 1/1 32-DA-0007 1/1 32-GP-0001 0/1 32-JA-0004 0/1 32-MA-0005 1/1
#> 32-PA-0002 32-PA-0006
#> 1 32-PA-0002 0/1
#> 2
#> 3 32-PA-0006 0/1
#> 4 32-PA-0002 0/1 32-PA-0006 0/1
#> 5 32-PA-0002 0/1 32-PA-0006 0/1
Explanation
The way this works is as follows. If we take the first row in the data frame, we can find out which (if any) of the entries belong to the first column by seeing whether any of them contain the name
of the first column. We use grep
for this, so we can do
grep(names(t)[1], t[1, ])
#> integer(0)
Which tells us that none of the strings in the first row should be in the first column. Since we want an empty string rather than an integer(0)
if there are no matches for a particular cell, we need to convert any empty matches to a ''
. We can do this by writing
z <- grep(names(t)[1], t[1, ])
if(length(z) == 0) '' else x[z]
which will return the matched string if there is one, and an empty string otherwise.
We can do this for each column, so that we can find out which column each cell in the row belongs to. We could write this as a loop:
result <- character(0)
for(i in 1:length(names(t))) {
z <- grep(names(t)[i], t[1, ])
result[i] <- if(length(z) == 0) '' else t[1, z]
}
result
#> [1] "" "" "" ""
#> [5] "" "32-PA-0002 0/1" ""
But a more concise way to do this in R is to use sapply
, so that we don't need to use a storage vector (result
), and don't need to use the i
to index:
sapply(names(t), function(y) {
z <- grep(y, t[1, ])
if(length(z) == 0) '' else t[1, z]
})
Now the problem is that this only gives us the result for a single row of our data frame, but we want a result for each row. We can use apply
to specify that we want to pass one row at a time, and this means we don't need to loop through each t[1,]
, t[2,]
etc individually. We just need to specify the data frame t
, and the margin we want to operate on (1, i.e. row-wise):
apply(t, 1, function(x) {
sapply(names(t), function(y) {
z <- grep(y, x)
if(length(z) == 0) '' else x[z]
})
})
#> [,1] [,2] [,3] [,4]
#> [1,] "" "32-AA-0003 0/1" "" ""
#> [2,] "" "" "" ""
#> [3,] "" "" "" "32-GP-0001 0/1"
#> [4,] "" "" "" "32-JA-0004 0/1"
#> [5,] "" "" "" ""
#> [6,] "32-PA-0002 0/1" "" "" "32-PA-0002 0/1"
#> [7,] "" "" "32-PA-0006 0/1" "32-PA-0006 0/1"
#> [,5]
#> [1,] "32-AA-0003 1/1"
#> [2,] "32-DA-0007 1/1"
#> [3,] "32-GP-0001 0/1"
#> [4,] "32-JA-0004 0/1"
#> [5,] "32-MA-0005 1/1"
#> [6,] "32-PA-0002 0/1"
#> [7,] "32-PA-0006 0/1"
Now we have a different problem, in that although we have a matrix with all the results, it is transposed. To reverse the transpose we use the function t
(this is confusing because you chose t
for the name of your data frame, which isn't a great idea for this very reason). Finally, we want to convert the transposed matrix into a data frame using as.data.frame
, and use setNames
to give this data frame the same names as our original data frame. So the final code is:
setNames(as.data.frame(t(apply(t, 1, function(x) {
sapply(names(t), function(y) {
z <- grep(y, x)
if(length(z) == 0) '' else x[z]
})
}))), names(t))