Home > other >  Data.Table R: a list of duplicated rows does not consistently show row duplications
Data.Table R: a list of duplicated rows does not consistently show row duplications

Time:06-01

I have a data.table of gene expression data with top5 genes from 44 clusters as follows;

> cluster.top5gene
     Cluster  Genes aveLog2FC          FDR
  1:       1  Cd79a  5.125957 0.000000e 00
  2:       1   Ly6d  3.918639 0.000000e 00
  3:       1  Cd79b  3.532945 0.000000e 00
  4:       1  Iglc2  3.523255 0.000000e 00
  5:       1   Ebf1  3.322775 0.000000e 00
 ---                                      
216:      44 Hba-a2  3.881978 4.074726e-31
217:      44 Hba-a1  3.892339 1.432746e-30
218:      44 Hbb-bs  3.971035 1.178994e-28
219:      44  Cd79a  2.629973 2.261226e-19
220:      44 Hbb-bt  3.139013 1.221915e-17

> str(cluster.top5gene)
Classes ‘data.table’ and 'data.frame':  220 obs. of  4 variables:
 $ Cluster  : int  1 1 1 1 1 2 2 2 2 2 ...
 $ Genes    : chr  "Cd79a" "Ly6d" "Cd79b" "Iglc2" ...
 $ aveLog2FC: num  5.13 3.92 3.53 3.52 3.32 ...
 $ FDR      : num  0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, "index")= int(0) 
  ..- attr(*, "__Genes")= int [1:220] 185 212 161 52 86 120 135 110 103 56 ...

There are duplicated gene names (under Genes);

    > cluster.top5gene[duplicated(cluster.top5gene, by="Genes"), Genes]
1] "Ptprb"  "Gsn"    "C1qa"   "C1qb"   "C1qc"   "Apoe"   "Nkg7"   "Ccl5"   "Cd3g"   "Gsn"    "Mgp"    "Nkg7"   "C1qa"  
[14] "C1qb"   "Ccl8"   "C1qc"   "Apoe"   "Car4"   "Kdr"    "Icam2"  "Emp2"   "Ly6c1"  "Car4"   "Ybx1"   "Sftpc"  "Sftpa1"
[27] "Cxcl15" "Sftpb"  "F13a1"  "Cxcl2"  "Ptprb"  "Tm4sf1" "Hba-a2" "Hba-a1" "Hbb-bs" "Cd79a"  "Hbb-bt"

And their corresponding row numbers;

> cluster.top5gene[, .I[duplicated(Genes)]
      ]
     [1]  20  65  72  73  74  75  76  77  79  81  85  91 111 112 113 114 115 121 122 125 126 128 129 130 156 157 158 159 193
    [30] 195 198 199 216 217 218 219 220

I made a list of duplicated gene names and correspond Cluster numbers as follows;

cluster.top5gene[duplicated(Genes, fromLast=F, by=Genes), Cluster, Genes]
Genes Cluster
 1:  Ptprb       4
 2:  Ptprb      40
 3:    Gsn      13
 4:    Gsn      17
 5:   C1qa      15
 6:   C1qa      23
 7:   C1qb      15
 8:   C1qb      23
 9:   C1qc      15
10:   C1qc      23
11:   Apoe      15
12:   Apoe      23
13:   Nkg7      16
14:   Nkg7      19
15:   Ccl5      16
16:   Cd3g      16
17:    Mgp      17
18:   Ccl8      23
19:   Car4      25
20:   Car4      26
21:    Kdr      25
22:  Icam2      25
23:   Emp2      26
24:  Ly6c1      26
25:   Ybx1      26
26:  Sftpc      32
27: Sftpa1      32
28: Cxcl15      32
29:  Sftpb      32
30:  F13a1      39
31:  Cxcl2      39
32: Tm4sf1      40
33: Hba-a2      44
34: Hba-a1      44
35: Hbb-bs      44
36:  Cd79a      44
37: Hbb-bt      44
     Genes Cluster

As you can see, some genes show duplications at different Cluster but others don't, which indeed have duplications in difference Cluster as in a following example;

> cluster.top5gene[Genes=="Ccl5",]
   Cluster Genes aveLog2FC FDR
1:       9  Ccl5  4.076985   0
2:      16  Ccl5  3.724350   0

I'd really appreciate any help on this issue.

CodePudding user response:

Again, without access to your data, I may be on the wrong track, but if you want a list of duplicate Genes and their clusters, perhaps better to just do this:

cluster.top5gene[, .SD[.N>1], by=Genes][, .(Genes, Cluster)]
  • Related