Say I have more than 200 files, each structured as depicted below:
# Peptide length 11
# Rank Threshold for Strong binding peptides 0.500
# Rank Threshold for Weak binding peptides 2.000
-----------------------------------------------------------------------------------
pos HLA peptide Core Offset I_pos I_len D_pos D_len iCore Identity 1-log50k(aff) Affinity(nM) %Rank BindLevel
-----------------------------------------------------------------------------------
0 HLA-B4402 GSHDLGIILQK GSHDLGIIL 0 0 0 0 0 GSHDLGIIL NM_000094_3_COL 0.015 42580.79 90.00
1 HLA-B4402 SHDLGIILQKI SLGIILQKI 0 0 0 1 2 SHDLGIILQKI NM_000094_3_COL 0.024 38731.55 65.00
2 HLA-B4402 HDLGIILQKIR HDLIILQKI 0 0 0 3 1 HDLGIILQKI NM_000094_3_COL 0.024 38400.24 65.00
3 HLA-B4402 DLGIILQKIRD DLGIILQKI 0 0 0 0 0 DLGIILQKI NM_000094_3_COL 0.011 44267.78 95.00
4 HLA-B4402 LGIILQKIRDM LGIILQRDM 0 0 0 6 2 LGIILQKIRDM NM_000094_3_COL 0.024 38411.46 65.00
5 HLA-B4402 GIILQKIRDMP GIILQIRDM 0 0 0 5 1 GIILQKIRDM NM_000094_3_COL 0.017 41463.75 80.00
6 HLA-B4402 IILQKIRDMPY IILQKIRDY 0 0 0 8 2 IILQKIRDMPY NM_000094_3_COL 0.025 38152.18 65.00
7 HLA-B4402 ILQKIRDMPYM ILQKIRMPY 0 0 0 6 1 ILQKIRDMPY NM_000094_3_COL 0.025 37993.98 60.00
8 HLA-B4402 LQKIRDMPYMD QKIRDMPYM 1 0 0 0 0 QKIRDMPYM NM_000094_3_COL 0.015 42595.54 90.00
9 HLA-B4402 QKIRDMPYMDP QKIRDMPYM 0 0 0 0 0 QKIRDMPYM NM_000094_3_COL 0.017 41645.82 85.00
10 HLA-B4402 KIRDMPYMDPS KDMPYMDPS 0 0 0 1 2 KIRDMPYMDPS NM_000094_3_COL 0.023 39039.53 70.00
11 HLA-B4402 IRDMPYMDPSX RDMPYMPSX 1 0 0 6 1 RDMPYMDPSX NM_000094_3_COL 0.036 33871.57 41.00
-----------------------------------------------------------------------------------
Protein NM_000094_3_COL. Allele HLA-B4402. Number of high binders 0. Number of weak binders 0. Number of peptides 12
-----------------------------------------------------------------------------------
# Rank Threshold for Strong binding peptides 0.500
# Rank Threshold for Weak binding peptides 2.000
-----------------------------------------------------------------------------------
pos HLA peptide Core Offset I_pos I_len D_pos D_len iCore Identity 1-log50k(aff) Affinity(nM) %Rank BindLevel
-----------------------------------------------------------------------------------
0 HLA-B4402 PVTGYKVQYTS TGYKVQYTS 2 0 0 0 0 TGYKVQYTS NM_000094_3_COL 0.011 44190.25 95.00
1 HLA-B4402 VTGYKVQYTSL VTGYQYTSL 0 0 0 4 2 VTGYKVQYTSL NM_000094_3_COL 0.020 40061.36 75.00
2 HLA-B4402 TGYKVQYTSLT TGYKVYTSL 0 0 0 5 1 TGYKVQYTSL NM_000094_3_COL 0.020 40487.08 75.00
3 HLA-B4402 GYKVQYTSLTG YVQYTSLTG 1 0 0 1 1 YKVQYTSLTG NM_000094_3_COL 0.017 41521.20 80.00
4 HLA-B4402 YKVQYTSLTGL YQYTSLTGL 0 0 0 1 2 YKVQYTSLTGL NM_000094_3_COL 0.031 35710.76 49.00
5 HLA-B4402 KVQYTSLTGLG KVQYTSLTL 0 0 0 8 1 KVQYTSLTGL NM_000094_3_COL 0.029 36392.20 55.00
6 HLA-B4402 VQYTSLTGLGQ VQYTSLTGL 0 0 0 0 0 VQYTSLTGL NM_000094_3_COL 0.016 42180.50 85.00
7 HLA-B4402 QYTSLTGLGQP QYTSLTGLG 0 0 0 0 0 QYTSLTGLG NM_000094_3_COL 0.011 44293.17 95.00
8 HLA-B4402 YTSLTGLGQPL YTSLLGQPL 0 0 0 4 2 YTSLTGLGQPL NM_000094_3_COL 0.034 34547.04 44.00
9 HLA-B4402 TSLTGLGQPLP SLTGLGQPL 1 0 0 0 0 SLTGLGQPL NM_000094_3_COL 0.024 38475.10 65.00
10 HLA-B4402 SLTGLGQPLPS SLTGLGQPL 0 0 0 0 0 SLTGLGQPL NM_000094_3_COL 0.026 37575.76 60.00
11 HLA-B4402 LTGLGQPLPSX LLGQPLPSX 0 0 0 1 2 LTGLGQPLPSX NM_000094_3_COL 0.014 42874.84 90.00
-----------------------------------------------------------------------------------
Protein NM_000094_3_COL. Allele HLA-B4402. Number of high binders 0. Number of weak binders 0. Number of peptides 12
-----------------------------------------------------------------------------------
# Rank Threshold for Strong binding peptides 0.500
# Rank Threshold for Weak binding peptides 2.000
-----------------------------------------------------------------------------------
pos HLA peptide Core Offset I_pos I_len D_pos D_len iCore Identity 1-log50k(aff) Affinity(nM) %Rank BindLevel
-----------------------------------------------------------------------------------
0 HLA-B4402 FLRLLDLAQEE RLLDLAQEE 2 0 0 0 0 RLLDLAQEE NM_000106_5_CYP 0.014 42841.45 90.00
1 HLA-B4402 LRLLDLAQEEL RLLDLAQEL 1 0 0 7 1 RLLDLAQEEL NM_000106_5_CYP 0.029 36648.25 55.00
2 HLA-B4402 RLLDLAQEELK RLLDLAQEL 0 0 0 7 1 RLLDLAQEEL NM_000106_5_CYP 0.029 36350.87 55.00
3 HLA-B4402 LLDLAQEELKE LLDLAQEEL 0 0 0 0 0 LLDLAQEEL NM_000106_5_CYP 0.013 43487.79 95.00
4 HLA-B4402 LDLAQEELKEE LDQEELKEE 0 0 0 2 2 LDLAQEELKEE NM_000106_5_CYP 0.008 45629.40 99.00
5 HLA-B4402 DLAQEELKEES AQEELKEES 2 0 0 0 0 AQEELKEES NM_000106_5_CYP 0.009 45287.57 99.00
6 HLA-B4402 LAQEELKEESG AEELKEESG 1 0 0 1 1 AQEELKEESG NM_000106_5_CYP 0.013 43568.32 95.00
7 HLA-B4402 AQEELKEESGF AELKEESGF 0 0 0 1 2 AQEELKEESGF NM_000106_5_CYP 0.231 4113.65 2.50
8 HLA-B4402 QEELKEESGFL QELKEESGF 0 0 0 1 1 QEELKEESGF NM_000106_5_CYP 0.123 13202.71 6.00
9 HLA-B4402 EELKEESGFLR EELKEESGF 0 0 0 0 0 EELKEESGF NM_000106_5_CYP 0.076 21904.46 13.00
10 HLA-B4402 ELKEESGFLRE ELKEESGFL 0 0 0 0 0 ELKEESGFL NM_000106_5_CYP 0.030 36301.74 55.00
11 HLA-B4402 LKEESGFLREX KEESFLREX 1 0 0 4 1 KEESGFLREX NM_000106_5_CYP 0.060 26205.35 19.00
-----------------------------------------------------------------------------------
As it can be seen, each file is basically a combination of tables (with identical headers) with text in between them. I would like to keep only the tables - and if possible get rid of those dashed lines, keeping only the data (and header) separated by \t
for each line.
The optimal result would be like this:
pos HLA peptide Core Offset I_pos I_len D_pos D_len iCore Identity 1-log50k(aff) Affinity(nM) %Rank BindLevel
0 HLA-B4402 GSHDLGIILQK GSHDLGIIL 0 0 0 0 0 GSHDLGIIL NM_000094_3_COL 0.015 42580.79 90.00
1 HLA-B4402 SHDLGIILQKI SLGIILQKI 0 0 0 1 2 SHDLGIILQKI NM_000094_3_COL 0.024 38731.55 65.00
2 HLA-B4402 HDLGIILQKIR HDLIILQKI 0 0 0 3 1 HDLGIILQKI NM_000094_3_COL 0.024 38400.24 65.00
3 HLA-B4402 DLGIILQKIRD DLGIILQKI 0 0 0 0 0 DLGIILQKI NM_000094_3_COL 0.011 44267.78 95.00
4 HLA-B4402 LGIILQKIRDM LGIILQRDM 0 0 0 6 2 LGIILQKIRDM NM_000094_3_COL 0.024 38411.46 65.00
5 HLA-B4402 GIILQKIRDMP GIILQIRDM 0 0 0 5 1 GIILQKIRDM NM_000094_3_COL 0.017 41463.75 80.00
6 HLA-B4402 IILQKIRDMPY IILQKIRDY 0 0 0 8 2 IILQKIRDMPY NM_000094_3_COL 0.025 38152.18 65.00
7 HLA-B4402 ILQKIRDMPYM ILQKIRMPY 0 0 0 6 1 ILQKIRDMPY NM_000094_3_COL 0.025 37993.98 60.00
8 HLA-B4402 LQKIRDMPYMD QKIRDMPYM 1 0 0 0 0 QKIRDMPYM NM_000094_3_COL 0.015 42595.54 90.00
9 HLA-B4402 QKIRDMPYMDP QKIRDMPYM 0 0 0 0 0 QKIRDMPYM NM_000094_3_COL 0.017 41645.82 85.00
10 HLA-B4402 KIRDMPYMDPS KDMPYMDPS 0 0 0 1 2 KIRDMPYMDPS NM_000094_3_COL 0.023 39039.53 70.00
11 HLA-B4402 IRDMPYMDPSX RDMPYMPSX 1 0 0 6 1 RDMPYMDPSX NM_000094_3_COL 0.036 33871.57 41.00
0 HLA-B4402 PVTGYKVQYTS TGYKVQYTS 2 0 0 0 0 TGYKVQYTS NM_000094_3_COL 0.011 44190.25 95.00
1 HLA-B4402 VTGYKVQYTSL VTGYQYTSL 0 0 0 4 2 VTGYKVQYTSL NM_000094_3_COL 0.020 40061.36 75.00
2 HLA-B4402 TGYKVQYTSLT TGYKVYTSL 0 0 0 5 1 TGYKVQYTSL NM_000094_3_COL 0.020 40487.08 75.00
3 HLA-B4402 GYKVQYTSLTG YVQYTSLTG 1 0 0 1 1 YKVQYTSLTG NM_000094_3_COL 0.017 41521.20 80.00
4 HLA-B4402 YKVQYTSLTGL YQYTSLTGL 0 0 0 1 2 YKVQYTSLTGL NM_000094_3_COL 0.031 35710.76 49.00
5 HLA-B4402 KVQYTSLTGLG KVQYTSLTL 0 0 0 8 1 KVQYTSLTGL NM_000094_3_COL 0.029 36392.20 55.00
6 HLA-B4402 VQYTSLTGLGQ VQYTSLTGL 0 0 0 0 0 VQYTSLTGL NM_000094_3_COL 0.016 42180.50 85.00
7 HLA-B4402 QYTSLTGLGQP QYTSLTGLG 0 0 0 0 0 QYTSLTGLG NM_000094_3_COL 0.011 44293.17 95.00
8 HLA-B4402 YTSLTGLGQPL YTSLLGQPL 0 0 0 4 2 YTSLTGLGQPL NM_000094_3_COL 0.034 34547.04 44.00
9 HLA-B4402 TSLTGLGQPLP SLTGLGQPL 1 0 0 0 0 SLTGLGQPL NM_000094_3_COL 0.024 38475.10 65.00
10 HLA-B4402 SLTGLGQPLPS SLTGLGQPL 0 0 0 0 0 SLTGLGQPL NM_000094_3_COL 0.026 37575.76 60.00
11 HLA-B4402 LTGLGQPLPSX LLGQPLPSX 0 0 0 1 2 LTGLGQPLPSX NM_000094_3_COL 0.014 42874.84 90.00
0 HLA-B4402 FLRLLDLAQEE RLLDLAQEE 2 0 0 0 0 RLLDLAQEE NM_000106_5_CYP 0.014 42841.45 90.00
1 HLA-B4402 LRLLDLAQEEL RLLDLAQEL 1 0 0 7 1 RLLDLAQEEL NM_000106_5_CYP 0.029 36648.25 55.00
2 HLA-B4402 RLLDLAQEELK RLLDLAQEL 0 0 0 7 1 RLLDLAQEEL NM_000106_5_CYP 0.029 36350.87 55.00
3 HLA-B4402 LLDLAQEELKE LLDLAQEEL 0 0 0 0 0 LLDLAQEEL NM_000106_5_CYP 0.013 43487.79 95.00
4 HLA-B4402 LDLAQEELKEE LDQEELKEE 0 0 0 2 2 LDLAQEELKEE NM_000106_5_CYP 0.008 45629.40 99.00
5 HLA-B4402 DLAQEELKEES AQEELKEES 2 0 0 0 0 AQEELKEES NM_000106_5_CYP 0.009 45287.57 99.00
6 HLA-B4402 LAQEELKEESG AEELKEESG 1 0 0 1 1 AQEELKEESG NM_000106_5_CYP 0.013 43568.32 95.00
7 HLA-B4402 AQEELKEESGF AELKEESGF 0 0 0 1 2 AQEELKEESGF NM_000106_5_CYP 0.231 4113.65 2.50
8 HLA-B4402 QEELKEESGFL QELKEESGF 0 0 0 1 1 QEELKEESGF NM_000106_5_CYP 0.123 13202.71 6.00
9 HLA-B4402 EELKEESGFLR EELKEESGF 0 0 0 0 0 EELKEESGF NM_000106_5_CYP 0.076 21904.46 13.00
10 HLA-B4402 ELKEESGFLRE ELKEESGFL 0 0 0 0 0 ELKEESGFL NM_000106_5_CYP 0.030 36301.74 55.00
11 HLA-B4402 LKEESGFLREX KEESFLREX 1 0 0 4 1 KEESGFLREX NM_000106_5_CYP 0.060 26205.35 19.00
So that's what I am struggling with:
1. How to concatenate all tables within the same file in a single table?
2. Is it possible to concatenate all tables from all files in a single table?
If there is a way to do it in R, it is also fine.
Thanks a lot!
PS: I went through the Similar questions section but couldn't find any solution in this line.
CodePudding user response:
It should be something like:
df_list <- lapply(file_names, read.table, skip = 6)
df <- do.call('rbind', df_list)
Then add your column names at the end.
CodePudding user response:
This will extract and parse the data from one file.
I've tried to split the data and add a header but I'm not 100% sure if it's worked properly,
library(dplyr)
original_df <-
as.data.frame(readLines("ProteinData.txt", warn = FALSE))
colnames(original_df) <- c("Column1")
header <- original_df %>% filter(str_detect(Column1, "^\\s pos"))
header <- unlist(str_split(head(header, 1), "\\s "))
header <- replace(header, header == "" , "Unused")
parsed_df <- original_df %>%
filter(str_detect(Column1, "^\\W \\d")) %>%
separate(Column1, header, sep = "\\s ") %>%
select(!c(1))
pos | HLA | peptide | Core | Offset | I_pos | I_len | D_pos | D_len | iCore | Identity | 1-log50k(aff) | Affinity(nM) | %Rank | BindLevel |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HLA-B4402 | GSHDLGIILQK | GSHDLGIIL | 0 | 0 | 0 | 0 | 0 | GSHDLGIIL | NM_000094_3_COL | 0.015 | 42580.79 | 90.00 | NA |
1 | HLA-B4402 | SHDLGIILQKI | SLGIILQKI | 0 | 0 | 0 | 1 | 2 | SHDLGIILQKI | NM_000094_3_COL | 0.024 | 38731.55 | 65.00 | NA |
2 | HLA-B4402 | HDLGIILQKIR | HDLIILQKI | 0 | 0 | 0 | 3 | 1 | HDLGIILQKI | NM_000094_3_COL | 0.024 | 38400.24 | 65.00 | NA |
3 | HLA-B4402 | DLGIILQKIRD | DLGIILQKI | 0 | 0 | 0 | 0 | 0 | DLGIILQKI | NM_000094_3_COL | 0.011 | 44267.78 | 95.00 | NA |
4 | HLA-B4402 | LGIILQKIRDM | LGIILQRDM | 0 | 0 | 0 | 6 | 2 | LGIILQKIRDM | NM_000094_3_COL | 0.024 | 38411.46 | 65.00 | NA |
5 | HLA-B4402 | GIILQKIRDMP | GIILQIRDM | 0 | 0 | 0 | 5 | 1 | GIILQKIRDM | NM_000094_3_COL | 0.017 | 41463.75 | 80.00 | NA |
6 | HLA-B4402 | IILQKIRDMPY | IILQKIRDY | 0 | 0 | 0 | 8 | 2 | IILQKIRDMPY | NM_000094_3_COL | 0.025 | 38152.18 | 65.00 | NA |
7 | HLA-B4402 | ILQKIRDMPYM | ILQKIRMPY | 0 | 0 | 0 | 6 | 1 | ILQKIRDMPY | NM_000094_3_COL | 0.025 | 37993.98 | 60.00 | NA |
8 | HLA-B4402 | LQKIRDMPYMD | QKIRDMPYM | 1 | 0 | 0 | 0 | 0 | QKIRDMPYM | NM_000094_3_COL | 0.015 | 42595.54 | 90.00 | NA |
9 | HLA-B4402 | QKIRDMPYMDP | QKIRDMPYM | 0 | 0 | 0 | 0 | 0 | QKIRDMPYM | NM_000094_3_COL | 0.017 | 41645.82 | 85.00 | NA |
10 | HLA-B4402 | KIRDMPYMDPS | KDMPYMDPS | 0 | 0 | 0 | 1 | 2 | KIRDMPYMDPS | NM_000094_3_COL | 0.023 | 39039.53 | 70.00 | NA |
11 | HLA-B4402 | IRDMPYMDPSX | RDMPYMPSX | 1 | 0 | 0 | 6 | 1 | RDMPYMDPSX | NM_000094_3_COL | 0.036 | 33871.57 | 41.00 | NA |
0 | HLA-B4402 | PVTGYKVQYTS | TGYKVQYTS | 2 | 0 | 0 | 0 | 0 | TGYKVQYTS | NM_000094_3_COL | 0.011 | 44190.25 | 95.00 | NA |
1 | HLA-B4402 | VTGYKVQYTSL | VTGYQYTSL | 0 | 0 | 0 | 4 | 2 | VTGYKVQYTSL | NM_000094_3_COL | 0.020 | 40061.36 | 75.00 | NA |
2 | HLA-B4402 | TGYKVQYTSLT | TGYKVYTSL | 0 | 0 | 0 | 5 | 1 | TGYKVQYTSL | NM_000094_3_COL | 0.020 | 40487.08 | 75.00 | NA |
3 | HLA-B4402 | GYKVQYTSLTG | YVQYTSLTG | 1 | 0 | 0 | 1 | 1 | YKVQYTSLTG | NM_000094_3_COL | 0.017 | 41521.20 | 80.00 | NA |
4 | HLA-B4402 | YKVQYTSLTGL | YQYTSLTGL | 0 | 0 | 0 | 1 | 2 | YKVQYTSLTGL | NM_000094_3_COL | 0.031 | 35710.76 | 49.00 | NA |
5 | HLA-B4402 | KVQYTSLTGLG | KVQYTSLTL | 0 | 0 | 0 | 8 | 1 | KVQYTSLTGL | NM_000094_3_COL | 0.029 | 36392.20 | 55.00 | NA |
6 | HLA-B4402 | VQYTSLTGLGQ | VQYTSLTGL | 0 | 0 | 0 | 0 | 0 | VQYTSLTGL | NM_000094_3_COL | 0.016 | 42180.50 | 85.00 | NA |
7 | HLA-B4402 | QYTSLTGLGQP | QYTSLTGLG | 0 | 0 | 0 | 0 | 0 | QYTSLTGLG | NM_000094_3_COL | 0.011 | 44293.17 | 95.00 | NA |
8 | HLA-B4402 | YTSLTGLGQPL | YTSLLGQPL | 0 | 0 | 0 | 4 | 2 | YTSLTGLGQPL | NM_000094_3_COL | 0.034 | 34547.04 | 44.00 | NA |
9 | HLA-B4402 | TSLTGLGQPLP | SLTGLGQPL | 1 | 0 | 0 | 0 | 0 | SLTGLGQPL | NM_000094_3_COL | 0.024 | 38475.10 | 65.00 | NA |
10 | HLA-B4402 | SLTGLGQPLPS | SLTGLGQPL | 0 | 0 | 0 | 0 | 0 | SLTGLGQPL | NM_000094_3_COL | 0.026 | 37575.76 | 60.00 | NA |
11 | HLA-B4402 | LTGLGQPLPSX | LLGQPLPSX | 0 | 0 | 0 | 1 | 2 | LTGLGQPLPSX | NM_000094_3_COL | 0.014 | 42874.84 | 90.00 | NA |
0 | HLA-B4402 | FLRLLDLAQEE | RLLDLAQEE | 2 | 0 | 0 | 0 | 0 | RLLDLAQEE | NM_000106_5_CYP | 0.014 | 42841.45 | 90.00 | NA |
1 | HLA-B4402 | LRLLDLAQEEL | RLLDLAQEL | 1 | 0 | 0 | 7 | 1 | RLLDLAQEEL | NM_000106_5_CYP | 0.029 | 36648.25 | 55.00 | NA |
2 | HLA-B4402 | RLLDLAQEELK | RLLDLAQEL | 0 | 0 | 0 | 7 | 1 | RLLDLAQEEL | NM_000106_5_CYP | 0.029 | 36350.87 | 55.00 | NA |
3 | HLA-B4402 | LLDLAQEELKE | LLDLAQEEL | 0 | 0 | 0 | 0 | 0 | LLDLAQEEL | NM_000106_5_CYP | 0.013 | 43487.79 | 95.00 | NA |
4 | HLA-B4402 | LDLAQEELKEE | LDQEELKEE | 0 | 0 | 0 | 2 | 2 | LDLAQEELKEE | NM_000106_5_CYP | 0.008 | 45629.40 | 99.00 | NA |
5 | HLA-B4402 | DLAQEELKEES | AQEELKEES | 2 | 0 | 0 | 0 | 0 | AQEELKEES | NM_000106_5_CYP | 0.009 | 45287.57 | 99.00 | NA |
6 | HLA-B4402 | LAQEELKEESG | AEELKEESG | 1 | 0 | 0 | 1 | 1 | AQEELKEESG | NM_000106_5_CYP | 0.013 | 43568.32 | 95.00 | NA |
7 | HLA-B4402 | AQEELKEESGF | AELKEESGF | 0 | 0 | 0 | 1 | 2 | AQEELKEESGF | NM_000106_5_CYP | 0.231 | 4113.65 | 2.50 | NA |
8 | HLA-B4402 | QEELKEESGFL | QELKEESGF | 0 | 0 | 0 | 1 | 1 | QEELKEESGF | NM_000106_5_CYP | 0.123 | 13202.71 | 6.00 | NA |
9 | HLA-B4402 | EELKEESGFLR | EELKEESGF | 0 | 0 | 0 | 0 | 0 | EELKEESGF | NM_000106_5_CYP | 0.076 | 21904.46 | 13.00 | NA |
10 | HLA-B4402 | ELKEESGFLRE | ELKEESGFL | 0 | 0 | 0 | 0 | 0 | ELKEESGFL | NM_000106_5_CYP | 0.030 | 36301.74 | 55.00 | NA |
11 | HLA-B4402 | LKEESGFLREX | KEESFLREX | 1 | 0 | 0 | 4 | 1 | KEESGFLREX | NM_000106_5_CYP | 0.060 | 26205.35 | 19.00 | NA |