I have a the results of two clusterings and I would like to create vectors so that all features that belong to the cluster are listed in a vector.
The following dataframe results from a clustering algorithm. The columns "C" are the clusters from two different algorithms.
| A1 | A2 | A3 | A4 | A5 | C1 | C2 |
| -- | -- | -- | -- | -- | -- | -- |
| 0 | 0 | 0 | 15 | 0 | 1 | 1 |
| 0 | 20 | 34 | 0 | 0 | 2 | 2 |
| 33 | 0 | 0 | 7 | 0 | 1 | 1 |
| 0 | 0 | 0 | 0 | 85 | 3 | 2 |
| 0 | 0 | 0 | 0 | 94 | 3 | 2 |
| 0 | 12 | 57 | 0 | 0 | 2 | 2 |
I want to create one vector for each cluster so that at the end I have
c11 = ['A1','A4']
c12 = ['A2','A3']
c13 = ['A5']
c21 = ['A1','A4']
c22 = ['A2','A3', 'A5']
EDIT: To be more specific, the code should create a vector for each cluster in this way: If the cluster has a value different from 0 in any of the cluster specific rows for a feature, then add this feature to the vector.
In the first step for the second clustering the algorithm looks at cluster C21 (Rows 1 and 3) according to this rows the features A4 and A1 might be positive in instances of the cluster. In the second step the algorithm looks at the rows 2, 4, 5 and 6 for C22. There the values A2, A3 might be positive (according to the 2nd and 6th row) and the A5 as well (according to the 4th and 5th row)
CodePudding user response:
Create a list
of column names for each row, where the value is not equal to 0, by looping across the row with apply
and MARGIN = 1
, Use the column 'C1', 'C2' to split the list
, loop over the outer list
and unlist
the inner list
elements, get the unique
and sort
it
l1 <- apply(df1[1:5] != 0, 1, FUN = function(x)
names(x)[x])
lst1 <- lapply(split(l1, df1$C1), function(x) sort(unique(unlist(x))))
lst2 <- lapply(split(l1, df1$C2), function(x) sort(unique(unlist(x))))
-output
> lst1
$`1`
[1] "A1" "A4"
$`2`
[1] "A2" "A3"
$`3`
[1] "A5"
> lst2
$`1`
[1] "A1" "A4"
$`2`
[1] "A2" "A3" "A5"