Home > Blockchain >  Merge multiple tables (identical headers) within a text file
Merge multiple tables (identical headers) within a text file

Time:12-03

Say I have more than 200 files, each structured as depicted below:

# Peptide length 11
# Rank Threshold for Strong binding peptides   0.500
# Rank Threshold for Weak binding peptides   2.000
-----------------------------------------------------------------------------------
  pos          HLA      peptide         Core Offset  I_pos  I_len  D_pos  D_len        iCore        Identity 1-log50k(aff) Affinity(nM)    %Rank  BindLevel
-----------------------------------------------------------------------------------
    0    HLA-B4402  GSHDLGIILQK    GSHDLGIIL      0      0      0      0      0    GSHDLGIIL NM_000094_3_COL         0.015     42580.79    90.00
    1    HLA-B4402  SHDLGIILQKI    SLGIILQKI      0      0      0      1      2  SHDLGIILQKI NM_000094_3_COL         0.024     38731.55    65.00
    2    HLA-B4402  HDLGIILQKIR    HDLIILQKI      0      0      0      3      1   HDLGIILQKI NM_000094_3_COL         0.024     38400.24    65.00
    3    HLA-B4402  DLGIILQKIRD    DLGIILQKI      0      0      0      0      0    DLGIILQKI NM_000094_3_COL         0.011     44267.78    95.00
    4    HLA-B4402  LGIILQKIRDM    LGIILQRDM      0      0      0      6      2  LGIILQKIRDM NM_000094_3_COL         0.024     38411.46    65.00
    5    HLA-B4402  GIILQKIRDMP    GIILQIRDM      0      0      0      5      1   GIILQKIRDM NM_000094_3_COL         0.017     41463.75    80.00
    6    HLA-B4402  IILQKIRDMPY    IILQKIRDY      0      0      0      8      2  IILQKIRDMPY NM_000094_3_COL         0.025     38152.18    65.00
    7    HLA-B4402  ILQKIRDMPYM    ILQKIRMPY      0      0      0      6      1   ILQKIRDMPY NM_000094_3_COL         0.025     37993.98    60.00
    8    HLA-B4402  LQKIRDMPYMD    QKIRDMPYM      1      0      0      0      0    QKIRDMPYM NM_000094_3_COL         0.015     42595.54    90.00
    9    HLA-B4402  QKIRDMPYMDP    QKIRDMPYM      0      0      0      0      0    QKIRDMPYM NM_000094_3_COL         0.017     41645.82    85.00
   10    HLA-B4402  KIRDMPYMDPS    KDMPYMDPS      0      0      0      1      2  KIRDMPYMDPS NM_000094_3_COL         0.023     39039.53    70.00
   11    HLA-B4402  IRDMPYMDPSX    RDMPYMPSX      1      0      0      6      1   RDMPYMDPSX NM_000094_3_COL         0.036     33871.57    41.00
-----------------------------------------------------------------------------------

Protein NM_000094_3_COL. Allele HLA-B4402. Number of high binders 0. Number of weak binders 0. Number of peptides 12

-----------------------------------------------------------------------------------
# Rank Threshold for Strong binding peptides   0.500
# Rank Threshold for Weak binding peptides   2.000
-----------------------------------------------------------------------------------
  pos          HLA      peptide         Core Offset  I_pos  I_len  D_pos  D_len        iCore        Identity 1-log50k(aff) Affinity(nM)    %Rank  BindLevel
-----------------------------------------------------------------------------------
    0    HLA-B4402  PVTGYKVQYTS    TGYKVQYTS      2      0      0      0      0    TGYKVQYTS NM_000094_3_COL         0.011     44190.25    95.00
    1    HLA-B4402  VTGYKVQYTSL    VTGYQYTSL      0      0      0      4      2  VTGYKVQYTSL NM_000094_3_COL         0.020     40061.36    75.00
    2    HLA-B4402  TGYKVQYTSLT    TGYKVYTSL      0      0      0      5      1   TGYKVQYTSL NM_000094_3_COL         0.020     40487.08    75.00
    3    HLA-B4402  GYKVQYTSLTG    YVQYTSLTG      1      0      0      1      1   YKVQYTSLTG NM_000094_3_COL         0.017     41521.20    80.00
    4    HLA-B4402  YKVQYTSLTGL    YQYTSLTGL      0      0      0      1      2  YKVQYTSLTGL NM_000094_3_COL         0.031     35710.76    49.00
    5    HLA-B4402  KVQYTSLTGLG    KVQYTSLTL      0      0      0      8      1   KVQYTSLTGL NM_000094_3_COL         0.029     36392.20    55.00
    6    HLA-B4402  VQYTSLTGLGQ    VQYTSLTGL      0      0      0      0      0    VQYTSLTGL NM_000094_3_COL         0.016     42180.50    85.00
    7    HLA-B4402  QYTSLTGLGQP    QYTSLTGLG      0      0      0      0      0    QYTSLTGLG NM_000094_3_COL         0.011     44293.17    95.00
    8    HLA-B4402  YTSLTGLGQPL    YTSLLGQPL      0      0      0      4      2  YTSLTGLGQPL NM_000094_3_COL         0.034     34547.04    44.00
    9    HLA-B4402  TSLTGLGQPLP    SLTGLGQPL      1      0      0      0      0    SLTGLGQPL NM_000094_3_COL         0.024     38475.10    65.00
   10    HLA-B4402  SLTGLGQPLPS    SLTGLGQPL      0      0      0      0      0    SLTGLGQPL NM_000094_3_COL         0.026     37575.76    60.00
   11    HLA-B4402  LTGLGQPLPSX    LLGQPLPSX      0      0      0      1      2  LTGLGQPLPSX NM_000094_3_COL         0.014     42874.84    90.00
-----------------------------------------------------------------------------------

Protein NM_000094_3_COL. Allele HLA-B4402. Number of high binders 0. Number of weak binders 0. Number of peptides 12

-----------------------------------------------------------------------------------
# Rank Threshold for Strong binding peptides   0.500
# Rank Threshold for Weak binding peptides   2.000
-----------------------------------------------------------------------------------
  pos          HLA      peptide         Core Offset  I_pos  I_len  D_pos  D_len        iCore        Identity 1-log50k(aff) Affinity(nM)    %Rank  BindLevel
-----------------------------------------------------------------------------------
    0    HLA-B4402  FLRLLDLAQEE    RLLDLAQEE      2      0      0      0      0    RLLDLAQEE NM_000106_5_CYP         0.014     42841.45    90.00
    1    HLA-B4402  LRLLDLAQEEL    RLLDLAQEL      1      0      0      7      1   RLLDLAQEEL NM_000106_5_CYP         0.029     36648.25    55.00
    2    HLA-B4402  RLLDLAQEELK    RLLDLAQEL      0      0      0      7      1   RLLDLAQEEL NM_000106_5_CYP         0.029     36350.87    55.00
    3    HLA-B4402  LLDLAQEELKE    LLDLAQEEL      0      0      0      0      0    LLDLAQEEL NM_000106_5_CYP         0.013     43487.79    95.00
    4    HLA-B4402  LDLAQEELKEE    LDQEELKEE      0      0      0      2      2  LDLAQEELKEE NM_000106_5_CYP         0.008     45629.40    99.00
    5    HLA-B4402  DLAQEELKEES    AQEELKEES      2      0      0      0      0    AQEELKEES NM_000106_5_CYP         0.009     45287.57    99.00
    6    HLA-B4402  LAQEELKEESG    AEELKEESG      1      0      0      1      1   AQEELKEESG NM_000106_5_CYP         0.013     43568.32    95.00
    7    HLA-B4402  AQEELKEESGF    AELKEESGF      0      0      0      1      2  AQEELKEESGF NM_000106_5_CYP         0.231      4113.65     2.50
    8    HLA-B4402  QEELKEESGFL    QELKEESGF      0      0      0      1      1   QEELKEESGF NM_000106_5_CYP         0.123     13202.71     6.00
    9    HLA-B4402  EELKEESGFLR    EELKEESGF      0      0      0      0      0    EELKEESGF NM_000106_5_CYP         0.076     21904.46    13.00
   10    HLA-B4402  ELKEESGFLRE    ELKEESGFL      0      0      0      0      0    ELKEESGFL NM_000106_5_CYP         0.030     36301.74    55.00
   11    HLA-B4402  LKEESGFLREX    KEESFLREX      1      0      0      4      1   KEESGFLREX NM_000106_5_CYP         0.060     26205.35    19.00
-----------------------------------------------------------------------------------

As it can be seen, each file is basically a combination of tables (with identical headers) with text in between them. I would like to keep only the tables - and if possible get rid of those dashed lines, keeping only the data (and header) separated by \t for each line.

The optimal result would be like this:

pos          HLA      peptide         Core Offset  I_pos  I_len  D_pos  D_len        iCore        Identity 1-log50k(aff) Affinity(nM)    %Rank  BindLevel
    0    HLA-B4402  GSHDLGIILQK    GSHDLGIIL      0      0      0      0      0    GSHDLGIIL NM_000094_3_COL         0.015     42580.79    90.00
    1    HLA-B4402  SHDLGIILQKI    SLGIILQKI      0      0      0      1      2  SHDLGIILQKI NM_000094_3_COL         0.024     38731.55    65.00
    2    HLA-B4402  HDLGIILQKIR    HDLIILQKI      0      0      0      3      1   HDLGIILQKI NM_000094_3_COL         0.024     38400.24    65.00
    3    HLA-B4402  DLGIILQKIRD    DLGIILQKI      0      0      0      0      0    DLGIILQKI NM_000094_3_COL         0.011     44267.78    95.00
    4    HLA-B4402  LGIILQKIRDM    LGIILQRDM      0      0      0      6      2  LGIILQKIRDM NM_000094_3_COL         0.024     38411.46    65.00
    5    HLA-B4402  GIILQKIRDMP    GIILQIRDM      0      0      0      5      1   GIILQKIRDM NM_000094_3_COL         0.017     41463.75    80.00
    6    HLA-B4402  IILQKIRDMPY    IILQKIRDY      0      0      0      8      2  IILQKIRDMPY NM_000094_3_COL         0.025     38152.18    65.00
    7    HLA-B4402  ILQKIRDMPYM    ILQKIRMPY      0      0      0      6      1   ILQKIRDMPY NM_000094_3_COL         0.025     37993.98    60.00
    8    HLA-B4402  LQKIRDMPYMD    QKIRDMPYM      1      0      0      0      0    QKIRDMPYM NM_000094_3_COL         0.015     42595.54    90.00
    9    HLA-B4402  QKIRDMPYMDP    QKIRDMPYM      0      0      0      0      0    QKIRDMPYM NM_000094_3_COL         0.017     41645.82    85.00
   10    HLA-B4402  KIRDMPYMDPS    KDMPYMDPS      0      0      0      1      2  KIRDMPYMDPS NM_000094_3_COL         0.023     39039.53    70.00
   11    HLA-B4402  IRDMPYMDPSX    RDMPYMPSX      1      0      0      6      1   RDMPYMDPSX NM_000094_3_COL         0.036     33871.57    41.00
    0    HLA-B4402  PVTGYKVQYTS    TGYKVQYTS      2      0      0      0      0    TGYKVQYTS NM_000094_3_COL         0.011     44190.25    95.00
    1    HLA-B4402  VTGYKVQYTSL    VTGYQYTSL      0      0      0      4      2  VTGYKVQYTSL NM_000094_3_COL         0.020     40061.36    75.00
    2    HLA-B4402  TGYKVQYTSLT    TGYKVYTSL      0      0      0      5      1   TGYKVQYTSL NM_000094_3_COL         0.020     40487.08    75.00
    3    HLA-B4402  GYKVQYTSLTG    YVQYTSLTG      1      0      0      1      1   YKVQYTSLTG NM_000094_3_COL         0.017     41521.20    80.00
    4    HLA-B4402  YKVQYTSLTGL    YQYTSLTGL      0      0      0      1      2  YKVQYTSLTGL NM_000094_3_COL         0.031     35710.76    49.00
    5    HLA-B4402  KVQYTSLTGLG    KVQYTSLTL      0      0      0      8      1   KVQYTSLTGL NM_000094_3_COL         0.029     36392.20    55.00
    6    HLA-B4402  VQYTSLTGLGQ    VQYTSLTGL      0      0      0      0      0    VQYTSLTGL NM_000094_3_COL         0.016     42180.50    85.00
    7    HLA-B4402  QYTSLTGLGQP    QYTSLTGLG      0      0      0      0      0    QYTSLTGLG NM_000094_3_COL         0.011     44293.17    95.00
    8    HLA-B4402  YTSLTGLGQPL    YTSLLGQPL      0      0      0      4      2  YTSLTGLGQPL NM_000094_3_COL         0.034     34547.04    44.00
    9    HLA-B4402  TSLTGLGQPLP    SLTGLGQPL      1      0      0      0      0    SLTGLGQPL NM_000094_3_COL         0.024     38475.10    65.00
   10    HLA-B4402  SLTGLGQPLPS    SLTGLGQPL      0      0      0      0      0    SLTGLGQPL NM_000094_3_COL         0.026     37575.76    60.00
   11    HLA-B4402  LTGLGQPLPSX    LLGQPLPSX      0      0      0      1      2  LTGLGQPLPSX NM_000094_3_COL         0.014     42874.84    90.00
    0    HLA-B4402  FLRLLDLAQEE    RLLDLAQEE      2      0      0      0      0    RLLDLAQEE NM_000106_5_CYP         0.014     42841.45    90.00
    1    HLA-B4402  LRLLDLAQEEL    RLLDLAQEL      1      0      0      7      1   RLLDLAQEEL NM_000106_5_CYP         0.029     36648.25    55.00
    2    HLA-B4402  RLLDLAQEELK    RLLDLAQEL      0      0      0      7      1   RLLDLAQEEL NM_000106_5_CYP         0.029     36350.87    55.00
    3    HLA-B4402  LLDLAQEELKE    LLDLAQEEL      0      0      0      0      0    LLDLAQEEL NM_000106_5_CYP         0.013     43487.79    95.00
    4    HLA-B4402  LDLAQEELKEE    LDQEELKEE      0      0      0      2      2  LDLAQEELKEE NM_000106_5_CYP         0.008     45629.40    99.00
    5    HLA-B4402  DLAQEELKEES    AQEELKEES      2      0      0      0      0    AQEELKEES NM_000106_5_CYP         0.009     45287.57    99.00
    6    HLA-B4402  LAQEELKEESG    AEELKEESG      1      0      0      1      1   AQEELKEESG NM_000106_5_CYP         0.013     43568.32    95.00
    7    HLA-B4402  AQEELKEESGF    AELKEESGF      0      0      0      1      2  AQEELKEESGF NM_000106_5_CYP         0.231      4113.65     2.50
    8    HLA-B4402  QEELKEESGFL    QELKEESGF      0      0      0      1      1   QEELKEESGF NM_000106_5_CYP         0.123     13202.71     6.00
    9    HLA-B4402  EELKEESGFLR    EELKEESGF      0      0      0      0      0    EELKEESGF NM_000106_5_CYP         0.076     21904.46    13.00
   10    HLA-B4402  ELKEESGFLRE    ELKEESGFL      0      0      0      0      0    ELKEESGFL NM_000106_5_CYP         0.030     36301.74    55.00
   11    HLA-B4402  LKEESGFLREX    KEESFLREX      1      0      0      4      1   KEESGFLREX NM_000106_5_CYP         0.060     26205.35    19.00

So that's what I am struggling with:

1. How to concatenate all tables within the same file in a single table?

2. Is it possible to concatenate all tables from all files in a single table?

If there is a way to do it in R, it is also fine.

Thanks a lot!

PS: I went through the Similar questions section but couldn't find any solution in this line.

CodePudding user response:

It should be something like:

df_list <- lapply(file_names, read.table, skip = 6)
df <- do.call('rbind', df_list)

Then add your column names at the end.

CodePudding user response:

This will extract and parse the data from one file.

I've tried to split the data and add a header but I'm not 100% sure if it's worked properly,

library(dplyr)

original_df <-
  as.data.frame(readLines("ProteinData.txt", warn = FALSE))

colnames(original_df) <- c("Column1")

header <- original_df %>% filter(str_detect(Column1, "^\\s pos"))

header <- unlist(str_split(head(header, 1), "\\s "))

header <- replace(header, header == "" , "Unused")

parsed_df <- original_df %>%
  filter(str_detect(Column1, "^\\W \\d")) %>%
  separate(Column1, header, sep = "\\s ") %>%
  select(!c(1))
pos HLA peptide Core Offset I_pos I_len D_pos D_len iCore Identity 1-log50k(aff) Affinity(nM) %Rank BindLevel
0 HLA-B4402 GSHDLGIILQK GSHDLGIIL 0 0 0 0 0 GSHDLGIIL NM_000094_3_COL 0.015 42580.79 90.00 NA
1 HLA-B4402 SHDLGIILQKI SLGIILQKI 0 0 0 1 2 SHDLGIILQKI NM_000094_3_COL 0.024 38731.55 65.00 NA
2 HLA-B4402 HDLGIILQKIR HDLIILQKI 0 0 0 3 1 HDLGIILQKI NM_000094_3_COL 0.024 38400.24 65.00 NA
3 HLA-B4402 DLGIILQKIRD DLGIILQKI 0 0 0 0 0 DLGIILQKI NM_000094_3_COL 0.011 44267.78 95.00 NA
4 HLA-B4402 LGIILQKIRDM LGIILQRDM 0 0 0 6 2 LGIILQKIRDM NM_000094_3_COL 0.024 38411.46 65.00 NA
5 HLA-B4402 GIILQKIRDMP GIILQIRDM 0 0 0 5 1 GIILQKIRDM NM_000094_3_COL 0.017 41463.75 80.00 NA
6 HLA-B4402 IILQKIRDMPY IILQKIRDY 0 0 0 8 2 IILQKIRDMPY NM_000094_3_COL 0.025 38152.18 65.00 NA
7 HLA-B4402 ILQKIRDMPYM ILQKIRMPY 0 0 0 6 1 ILQKIRDMPY NM_000094_3_COL 0.025 37993.98 60.00 NA
8 HLA-B4402 LQKIRDMPYMD QKIRDMPYM 1 0 0 0 0 QKIRDMPYM NM_000094_3_COL 0.015 42595.54 90.00 NA
9 HLA-B4402 QKIRDMPYMDP QKIRDMPYM 0 0 0 0 0 QKIRDMPYM NM_000094_3_COL 0.017 41645.82 85.00 NA
10 HLA-B4402 KIRDMPYMDPS KDMPYMDPS 0 0 0 1 2 KIRDMPYMDPS NM_000094_3_COL 0.023 39039.53 70.00 NA
11 HLA-B4402 IRDMPYMDPSX RDMPYMPSX 1 0 0 6 1 RDMPYMDPSX NM_000094_3_COL 0.036 33871.57 41.00 NA
0 HLA-B4402 PVTGYKVQYTS TGYKVQYTS 2 0 0 0 0 TGYKVQYTS NM_000094_3_COL 0.011 44190.25 95.00 NA
1 HLA-B4402 VTGYKVQYTSL VTGYQYTSL 0 0 0 4 2 VTGYKVQYTSL NM_000094_3_COL 0.020 40061.36 75.00 NA
2 HLA-B4402 TGYKVQYTSLT TGYKVYTSL 0 0 0 5 1 TGYKVQYTSL NM_000094_3_COL 0.020 40487.08 75.00 NA
3 HLA-B4402 GYKVQYTSLTG YVQYTSLTG 1 0 0 1 1 YKVQYTSLTG NM_000094_3_COL 0.017 41521.20 80.00 NA
4 HLA-B4402 YKVQYTSLTGL YQYTSLTGL 0 0 0 1 2 YKVQYTSLTGL NM_000094_3_COL 0.031 35710.76 49.00 NA
5 HLA-B4402 KVQYTSLTGLG KVQYTSLTL 0 0 0 8 1 KVQYTSLTGL NM_000094_3_COL 0.029 36392.20 55.00 NA
6 HLA-B4402 VQYTSLTGLGQ VQYTSLTGL 0 0 0 0 0 VQYTSLTGL NM_000094_3_COL 0.016 42180.50 85.00 NA
7 HLA-B4402 QYTSLTGLGQP QYTSLTGLG 0 0 0 0 0 QYTSLTGLG NM_000094_3_COL 0.011 44293.17 95.00 NA
8 HLA-B4402 YTSLTGLGQPL YTSLLGQPL 0 0 0 4 2 YTSLTGLGQPL NM_000094_3_COL 0.034 34547.04 44.00 NA
9 HLA-B4402 TSLTGLGQPLP SLTGLGQPL 1 0 0 0 0 SLTGLGQPL NM_000094_3_COL 0.024 38475.10 65.00 NA
10 HLA-B4402 SLTGLGQPLPS SLTGLGQPL 0 0 0 0 0 SLTGLGQPL NM_000094_3_COL 0.026 37575.76 60.00 NA
11 HLA-B4402 LTGLGQPLPSX LLGQPLPSX 0 0 0 1 2 LTGLGQPLPSX NM_000094_3_COL 0.014 42874.84 90.00 NA
0 HLA-B4402 FLRLLDLAQEE RLLDLAQEE 2 0 0 0 0 RLLDLAQEE NM_000106_5_CYP 0.014 42841.45 90.00 NA
1 HLA-B4402 LRLLDLAQEEL RLLDLAQEL 1 0 0 7 1 RLLDLAQEEL NM_000106_5_CYP 0.029 36648.25 55.00 NA
2 HLA-B4402 RLLDLAQEELK RLLDLAQEL 0 0 0 7 1 RLLDLAQEEL NM_000106_5_CYP 0.029 36350.87 55.00 NA
3 HLA-B4402 LLDLAQEELKE LLDLAQEEL 0 0 0 0 0 LLDLAQEEL NM_000106_5_CYP 0.013 43487.79 95.00 NA
4 HLA-B4402 LDLAQEELKEE LDQEELKEE 0 0 0 2 2 LDLAQEELKEE NM_000106_5_CYP 0.008 45629.40 99.00 NA
5 HLA-B4402 DLAQEELKEES AQEELKEES 2 0 0 0 0 AQEELKEES NM_000106_5_CYP 0.009 45287.57 99.00 NA
6 HLA-B4402 LAQEELKEESG AEELKEESG 1 0 0 1 1 AQEELKEESG NM_000106_5_CYP 0.013 43568.32 95.00 NA
7 HLA-B4402 AQEELKEESGF AELKEESGF 0 0 0 1 2 AQEELKEESGF NM_000106_5_CYP 0.231 4113.65 2.50 NA
8 HLA-B4402 QEELKEESGFL QELKEESGF 0 0 0 1 1 QEELKEESGF NM_000106_5_CYP 0.123 13202.71 6.00 NA
9 HLA-B4402 EELKEESGFLR EELKEESGF 0 0 0 0 0 EELKEESGF NM_000106_5_CYP 0.076 21904.46 13.00 NA
10 HLA-B4402 ELKEESGFLRE ELKEESGFL 0 0 0 0 0 ELKEESGFL NM_000106_5_CYP 0.030 36301.74 55.00 NA
11 HLA-B4402 LKEESGFLREX KEESFLREX 1 0 0 4 1 KEESGFLREX NM_000106_5_CYP 0.060 26205.35 19.00 NA
  • Related