How can I choose the longest of each locus?-CodePudding

I have a CSV file that I got from NCBI and I want to select in R the one with the largest value in the "length" column of each repetitive "locus".

For example it repeats AGER in the Locus column and when I checked it I need to get it as the longest AGER is in the 16th row

My file

CodePudding user response：

You can do this:

library(data.table)
fread("proteins_51_1820449.csv")[order(-Length),first(.SD), Locus]

Output:

              Locus         #Name      Accession     Start      Stop Strand    GeneID Locus tag Protein product Length                           Protein Name
    1:          TTN  chromosome 2   NC_000002.12 178527012 178804642      -      7273         -  NP_001254479.2  35991                       titin isoform IC
    2:        MUC16            Un NW_025791807.1     83875    260124      -     94025         -  NP_001388430.1  15349                     mucin-16 precursor
    3:        OBSCN  chromosome 1   NC_000001.11 228211784 228378795            84033         -  NP_001373054.1   8925                     obscurin isoform c
    4:        SYNE1  chromosome 6   NC_000006.12 152122436 152628331      -     23345         -  XP_016866097.1   8846                   nesprin-1 isoform X1
    5:          NEB  chromosome 2   NC_000002.12 151485760 151733156      -      4703         -  NP_001258137.2   8560                      nebulin isoform 4
   ---                                                                                                                                                       
19995:        HRURF  chromosome 8   NC_000008.11  22130604  22130708      - 120766137         -  NP_001381061.1     34                          protein HRURF
19996:      BLACAT1  chromosome 1   NC_000001.11 205440925 205441026      - 101669762         -  NP_001384355.1     33 bladder cancer associated transcript 1
19997:          SLN chromosome 11   NC_000011.10 107707835 107707930      -      6588         -     NP_003054.1     31                             sarcolipin
19998: LOC105372440 chromosome 19   NC_000019.10  50785812  50786104      - 105372440         -  NP_001371526.1     28   uncharacterized protein LOC105372440
19999:        RPL41 chromosome 12   NC_000012.12  56116788  56117524             6171         -     NP_066927.1     25              60S ribosomal protein L41

If speed is important, this approach will help greatly

CodePudding user response：

Using tidyverse you could do

library(tidyverse)

df <- read.csv(path_to_csv)

df %>%
  group_by(Locus) %>%
  slice_max(Length, n = 1) %>%
  slice_head(n = 1)
#> # A tibble: 19,999 x 11
#> # Groups:   Locus [19,999]
#>    X.Name       Acces~1  Start   Stop Strand GeneID Locus Locus~2 Prote~3 Length
#>    <chr>        <chr>    <int>  <int> <chr>   <int> <chr> <chr>   <chr>    <int>
#>  1 chromosome ~ NC_000~ 5.83e7 5.84e7 -           1 A1BG  -       NP_570~    495
#>  2 chromosome ~ NC_000~ 5.08e7 5.09e7 -       29974 A1CF  -       XP_047~    602
#>  3 chromosome ~ NC_000~ 9.07e6 9.12e6 -           2 A2M   -       XP_006~   1512
#>  4 chromosome ~ NC_000~ 8.82e6 8.88e6        144568 A2ML1 -       XP_011~   1467
#>  5 chromosome 1 NC_000~ 3.33e7 3.33e7 -      127550 A3GA~ -       NP_001~    340
#>  6 chromosome ~ NC_000~ 4.27e7 4.27e7 -       53947 A4GA~ -       XP_016~    353
#>  7 chromosome 3 NC_000~ 1.38e8 1.38e8 -       51146 A4GNT -       XP_016~    340
#>  8 chromosome ~ NC_000~ 5.33e7 5.33e7 -        8086 AAAS  -       NP_056~    546
#>  9 chromosome ~ NC_000~ 1.25e8 1.25e8         65985 AACS  -       NP_076~    672
#> 10 chromosome 3 NC_000~ 1.52e8 1.52e8            13 AADAC -       NP_001~    399
#> # ... with 19,989 more rows, 1 more variable: Protein.Name <chr>, and
#> #   abbreviated variable names 1: Accession, 2: Locus.tag, 3: Protein.product

^{Created on 2022-09-25 with reprex v2.0.2}