I have a CSV file that I got from NCBI and I want to select in R
the one with the largest value in the "length" column of each repetitive "locus".
For example it repeats AGER in the Locus column and when I checked it I need to get it as the longest AGER is in the 16th row
CodePudding user response:
You can do this:
library(data.table)
fread("proteins_51_1820449.csv")[order(-Length),first(.SD), Locus]
Output:
Locus #Name Accession Start Stop Strand GeneID Locus tag Protein product Length Protein Name
1: TTN chromosome 2 NC_000002.12 178527012 178804642 - 7273 - NP_001254479.2 35991 titin isoform IC
2: MUC16 Un NW_025791807.1 83875 260124 - 94025 - NP_001388430.1 15349 mucin-16 precursor
3: OBSCN chromosome 1 NC_000001.11 228211784 228378795 84033 - NP_001373054.1 8925 obscurin isoform c
4: SYNE1 chromosome 6 NC_000006.12 152122436 152628331 - 23345 - XP_016866097.1 8846 nesprin-1 isoform X1
5: NEB chromosome 2 NC_000002.12 151485760 151733156 - 4703 - NP_001258137.2 8560 nebulin isoform 4
---
19995: HRURF chromosome 8 NC_000008.11 22130604 22130708 - 120766137 - NP_001381061.1 34 protein HRURF
19996: BLACAT1 chromosome 1 NC_000001.11 205440925 205441026 - 101669762 - NP_001384355.1 33 bladder cancer associated transcript 1
19997: SLN chromosome 11 NC_000011.10 107707835 107707930 - 6588 - NP_003054.1 31 sarcolipin
19998: LOC105372440 chromosome 19 NC_000019.10 50785812 50786104 - 105372440 - NP_001371526.1 28 uncharacterized protein LOC105372440
19999: RPL41 chromosome 12 NC_000012.12 56116788 56117524 6171 - NP_066927.1 25 60S ribosomal protein L41
If speed is important, this approach will help greatly
CodePudding user response:
Using tidyverse you could do
library(tidyverse)
df <- read.csv(path_to_csv)
df %>%
group_by(Locus) %>%
slice_max(Length, n = 1) %>%
slice_head(n = 1)
#> # A tibble: 19,999 x 11
#> # Groups: Locus [19,999]
#> X.Name Acces~1 Start Stop Strand GeneID Locus Locus~2 Prote~3 Length
#> <chr> <chr> <int> <int> <chr> <int> <chr> <chr> <chr> <int>
#> 1 chromosome ~ NC_000~ 5.83e7 5.84e7 - 1 A1BG - NP_570~ 495
#> 2 chromosome ~ NC_000~ 5.08e7 5.09e7 - 29974 A1CF - XP_047~ 602
#> 3 chromosome ~ NC_000~ 9.07e6 9.12e6 - 2 A2M - XP_006~ 1512
#> 4 chromosome ~ NC_000~ 8.82e6 8.88e6 144568 A2ML1 - XP_011~ 1467
#> 5 chromosome 1 NC_000~ 3.33e7 3.33e7 - 127550 A3GA~ - NP_001~ 340
#> 6 chromosome ~ NC_000~ 4.27e7 4.27e7 - 53947 A4GA~ - XP_016~ 353
#> 7 chromosome 3 NC_000~ 1.38e8 1.38e8 - 51146 A4GNT - XP_016~ 340
#> 8 chromosome ~ NC_000~ 5.33e7 5.33e7 - 8086 AAAS - NP_056~ 546
#> 9 chromosome ~ NC_000~ 1.25e8 1.25e8 65985 AACS - NP_076~ 672
#> 10 chromosome 3 NC_000~ 1.52e8 1.52e8 13 AADAC - NP_001~ 399
#> # ... with 19,989 more rows, 1 more variable: Protein.Name <chr>, and
#> # abbreviated variable names 1: Accession, 2: Locus.tag, 3: Protein.product
Created on 2022-09-25 with reprex v2.0.2