I have a data.table of gene expression data with top5 genes from 44 clusters as follows;
> cluster.top5gene
Cluster Genes aveLog2FC FDR
1: 1 Cd79a 5.125957 0.000000e 00
2: 1 Ly6d 3.918639 0.000000e 00
3: 1 Cd79b 3.532945 0.000000e 00
4: 1 Iglc2 3.523255 0.000000e 00
5: 1 Ebf1 3.322775 0.000000e 00
---
216: 44 Hba-a2 3.881978 4.074726e-31
217: 44 Hba-a1 3.892339 1.432746e-30
218: 44 Hbb-bs 3.971035 1.178994e-28
219: 44 Cd79a 2.629973 2.261226e-19
220: 44 Hbb-bt 3.139013 1.221915e-17
> str(cluster.top5gene)
Classes ‘data.table’ and 'data.frame': 220 obs. of 4 variables:
$ Cluster : int 1 1 1 1 1 2 2 2 2 2 ...
$ Genes : chr "Cd79a" "Ly6d" "Cd79b" "Iglc2" ...
$ aveLog2FC: num 5.13 3.92 3.53 3.52 3.32 ...
$ FDR : num 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "index")= int(0)
..- attr(*, "__Genes")= int [1:220] 185 212 161 52 86 120 135 110 103 56 ...
There are duplicated gene names (under Genes);
> cluster.top5gene[duplicated(cluster.top5gene, by="Genes"), Genes]
1] "Ptprb" "Gsn" "C1qa" "C1qb" "C1qc" "Apoe" "Nkg7" "Ccl5" "Cd3g" "Gsn" "Mgp" "Nkg7" "C1qa"
[14] "C1qb" "Ccl8" "C1qc" "Apoe" "Car4" "Kdr" "Icam2" "Emp2" "Ly6c1" "Car4" "Ybx1" "Sftpc" "Sftpa1"
[27] "Cxcl15" "Sftpb" "F13a1" "Cxcl2" "Ptprb" "Tm4sf1" "Hba-a2" "Hba-a1" "Hbb-bs" "Cd79a" "Hbb-bt"
And their corresponding row numbers;
> cluster.top5gene[, .I[duplicated(Genes)]
]
[1] 20 65 72 73 74 75 76 77 79 81 85 91 111 112 113 114 115 121 122 125 126 128 129 130 156 157 158 159 193
[30] 195 198 199 216 217 218 219 220
I made a list of duplicated gene names and correspond Cluster
numbers as follows;
cluster.top5gene[duplicated(Genes, fromLast=F, by=Genes), Cluster, Genes]
Genes Cluster
1: Ptprb 4
2: Ptprb 40
3: Gsn 13
4: Gsn 17
5: C1qa 15
6: C1qa 23
7: C1qb 15
8: C1qb 23
9: C1qc 15
10: C1qc 23
11: Apoe 15
12: Apoe 23
13: Nkg7 16
14: Nkg7 19
15: Ccl5 16
16: Cd3g 16
17: Mgp 17
18: Ccl8 23
19: Car4 25
20: Car4 26
21: Kdr 25
22: Icam2 25
23: Emp2 26
24: Ly6c1 26
25: Ybx1 26
26: Sftpc 32
27: Sftpa1 32
28: Cxcl15 32
29: Sftpb 32
30: F13a1 39
31: Cxcl2 39
32: Tm4sf1 40
33: Hba-a2 44
34: Hba-a1 44
35: Hbb-bs 44
36: Cd79a 44
37: Hbb-bt 44
Genes Cluster
As you can see, some genes show duplications at different Cluster
but others don't, which indeed have duplications in difference Cluster
as in a following example;
> cluster.top5gene[Genes=="Ccl5",]
Cluster Genes aveLog2FC FDR
1: 9 Ccl5 4.076985 0
2: 16 Ccl5 3.724350 0
I'd really appreciate any help on this issue.
CodePudding user response:
Again, without access to your data, I may be on the wrong track, but if you want a list of duplicate Genes and their clusters, perhaps better to just do this:
cluster.top5gene[, .SD[.N>1], by=Genes][, .(Genes, Cluster)]