I am new to R and struggling with grouping my dataset. This is an example of the data:
sample | profile |
---|---|
1 | A |
2 | A,B |
3 | A,B |
4 | A,C |
5 | C |
6 | A,C |
I am trying to group the profiles so that the same profiles are labelled as the same group:
sample | profile | profile group/cluster |
---|---|---|
genome 1 | A | 1 |
genome 2 | A,B | 2 |
genome 3 | A,B | 2 |
genome 4 | A,C | 3 |
genome 5 | C | 4 |
genome 6 | A,C | 3 |
From this, profiles A,B and A,C have been grouped together.
I have tried playing around with these packages
library(tidyverse)
library(janitor)
library(stringr)
dupes <- get_dupes(database, profile)
dupes
ll_by_outcome <- as.data.frame(database %>%
group_by(profile) %>%
add_count())
ll_by_outcome
But these just find duplicates within the sample. I am not sure how to go about this issue. Any help is appreciated!
CodePudding user response:
We could use match
library(dplyr)
library(stringr)
df1 %>%
mutate(group = match(profile, unique(profile)),
sample = str_c('genome ', sample))
-output
sample profile group
1 genome 1 A 1
2 genome 2 A,B 2
3 genome 3 A,B 2
4 genome 4 A,C 3
5 genome 5 C 4
6 genome 6 A,C 3
data
df1 <- structure(list(sample = 1:6, profile = c("A", "A,B", "A,B", "A,C",
"C", "A,C")), class = "data.frame", row.names = c(NA, -6L))
CodePudding user response:
Does this work:
library(dplyr)
df %>% mutate(sample = str_c('genome', sample, sep = ' ')) %>% group_by(profile) %>% mutate(cluster = cur_group_id())
# A tibble: 6 × 3
# Groups: profile [4]
sample profile cluster
<chr> <chr> <int>
1 genome 1 A 1
2 genome 2 A,B 2
3 genome 3 A,B 2
4 genome 4 A,C 3
5 genome 5 C 4
6 genome 6 A,C 3
CodePudding user response:
You can do it using factors.
With the data from @akrun's answer:
df1 %>% mutate(cluster = as.numeric(factor(profile)))