I have an R data frame in R which looks like this:
data.frame(ID = c("1", "1", "1", "1", "2", "2","2","2","3","3","3","3"),
plist = c("CP1", "P2", "P3", "P4", "P1", "P2","P3","CP2","CP1","P2","P3","P5"))
Now, I want to create a matrix with counts based on the values present in plist as below:
P1 | P2 | P3 | P4 | P5 | |
---|---|---|---|---|---|
CP1 | 0 | 2 | 2 | 1 | 1 |
CP2 | 1 | 1 | 1 | 0 | 0 |
This means CP1 is present along with P2 in 2 samples, with P3 in 2 samples but with P5 in only 1 sample.
Any ideas how can I do this?
CodePudding user response:
Your data is very badly structured. Here is an idea though,
library(dplyr)
library(tidyr)
df %>%
arrange(ID, plist) %>%
group_by(ID) %>%
mutate(new = first(plist)) %>%
slice(-1L) %>%
arrange(plist)%>%
pivot_wider(names_from = plist, values_from = ID, values_fn = length, values_fill = 0) %>%
arrange(new)
# A tibble: 2 × 6
new P1 P2 P3 P4 P5
<chr> <int> <int> <int> <int> <int>
1 CP1 0 2 2 1 1
2 CP2 1 1 1 0 0
CodePudding user response:
Here is a solution with base R.
First, let's assume that the data-frame is called dat
:
dat <- data.frame(ID = c("1", "1", "1", "1", "2", "2","2","2","3","3","3","3"),
plist = c("CP1", "P2", "P3", "P4", "P1", "P2","P3","CP2","CP1","P2","P3","P5"))
Then create the "empty" count matrix
plist <- dat$plist
rowvec <- sort(unique(grep("^CP[0-9] $", plist, value = TRUE)))
colvec <- sort(unique(grep("^P[0-9] $", plist, value = TRUE)))
count.mat <- matrix(0, length(rowvec), length(colvec), dimnames = list(rowvec, colvec))
Then loop on the IDs:
for (id in unique(dat$ID)) {
rowids <- grep("^CP[0-9] $", dat$plist[dat$ID == id], value = TRUE)
colids <- grep("^P[0-9] $", dat$plist[dat$ID == id], value = TRUE)
count.mat[rowids, colids] <- count.mat[rowids, colids] 1
}
You might want to add some tests to make sure that there are no duplicate "CPi" or "Pi" per ID.
Here is what the result looks like
P1 P2 P3 P4 P5
CP1 0 2 2 1 1
CP2 1 1 1 0 0