Home > other >  R function to combine two text files depending on variable name in column1?
R function to combine two text files depending on variable name in column1?

Time:12-01

I have a text file with 846 rows and 11 columns. The first column contains different endpoints e.g. armlymphodoedema which has 65 rows, AtrophyG1 with 65 rows, AtrophyG2 with 65 rows, hyperpigmentation with 65 rows, IndurationG1 with 65 rows etc... There is a total of 12 endpoints with 65 rows each. They have a colon and number at the end which runs from 1-23. Not all rows shown:

Endpoint:CHROM  POS ID  REF ALT A1  OBS_CT  BETA    SE  T_STAT  P
armlymohoedemaG1_resid_ADD:1    15507197    rs12145963  C   T   T   1597    0.0200969   0.0177223   1.13399 0.25697
armlymohoedemaG1_resid_ADD:1    20955183    rs10916837:20955183:A:C A   C   C   1597    -0.0171274  0.0358423   -0.477853   0.632821
armlymohoedemaG1_resid_ADD:1    59499623    rs12088220:59499623:C:G C   G   G   1597    0.0380941   0.0318187   1.19722 0.2314
armlymohoedemaG1_resid_ADD:1    157426570   rs12562052:157426570:G:A    G   A   A   1597    -0.0206825  0.0197705   -1.04613    0.29566
armlymohoedemaG1_resid_ADD:1    208935416   rs12565978:208935416:T:C    T   C   C   1597    0.0125627   0.0249297   0.503926    0.614384
armlymohoedemaG1_resid_ADD:1    210570783   rs926581:210570783:T:C  T   C   T   1597    0.0100879   0.0171234   0.589128    0.555859
armlymohoedemaG1_resid_ADD:2    67992075    rs730424:67992075:G:A   G   A   A   1597    0.0576296   0.0400781   1.43793 0.150651
...
AtrophyG1_resid_ADD:1   15507197    rs12145963  C   T   T   1459    0.0603735   0.0423087   1.42698 0.153803
AtrophyG1_resid_ADD:1   20955183    rs10916837:20955183:A:C A   C   C   1459    -0.120878   0.0858682   -1.40771    0.159431
AtrophyG1_resid_ADD:1   59499623    rs12088220:59499623:C:G C   G   G   1459    -0.00668454 0.0769204   -0.0869021  0.930761
AtrophyG1_resid_ADD:1   157426570   rs12562052:157426570:G:A    G   A   A   1459    -0.00656737 0.0474368   -0.138445   0.889908
AtrophyG1_resid_ADD:1   208935416   rs12565978:208935416:T:C    T   C   C   1459    -0.00449699 0.0589798   -0.0762463  0.939234
AtrophyG1_resid_ADD:1   210570783   rs926581:210570783:T:C  T   C   T   1459    -0.0401359  0.0409854   -0.979273   0.327609
AtrophyG1_resid_ADD:2   67992075    rs730424:67992075:G:A   G   A   A   1459    -0.139139   0.0958094   -1.45225    0.146649
...
AtrophyG2_resid_ADD:1   15507197    rs12145963  C   T   T   1577    0.0379506   0.0295897   1.28256 0.199837
AtrophyG2_resid_ADD:1   20955183    rs10916837:20955183:A:C A   C   C   1577    -0.0396727  0.0593204   -0.668786   0.503731
AtrophyG2_resid_ADD:1   59499623    rs12088220:59499623:C:G C   G   G   1577    0.0372104   0.0532158   0.699236    0.484509
AtrophyG2_resid_ADD:1   157426570   rs12562052:157426570:G:A    G   A   A   1577    -0.00643034 0.0329302   -0.195272   0.845206
AtrophyG2_resid_ADD:1   208935416   rs12565978:208935416:T:C    T   C   C   1577    -0.0642606  0.0413718   -1.55325    0.120567
AtrophyG2_resid_ADD:1   210570783   rs926581:210570783:T:C  T   C   T   1577    -0.00569627 0.028436    -0.200319   0.841257
AtrophyG2_resid_ADD:2   67992075    rs730424:67992075:G:A   G   A   A   1577    -0.0725321  0.0665037   -1.09065    0.275596

I have another text file with the exact same number of columns but different number of rows for each endpoint in column 1. For example, the endpoints follow the same order but are repeated 8 times. E.g. armlymphodoedema has 8 rows, AtrophyG1 with 8 rows, AtrophyG2 has 8 rows, hyperpigmentation with 8 rows, IndurationG1 with 8 rows etc...

Endpoint:CHROM  POS ID  REF ALT A1  OBS_CT  BETA    SE  T_STAT  P
armlymohoedemaG1_resid_ADD:6    31568469    rs2857595   G   A   A   1597    0.0312277   0.0236413   1.3209  0.186727
armlymohoedemaG1_resid_ADD:8    3059828 chr8_3059828_A_G    A   G   G   1597    -0.0239881  0.0234187   -1.02432    0.305842
armlymohoedemaG1_resid_ADD:9    23658011    chr9_23658011_G_T   T   G   T   1597    -0.0211319  0.0187522   -1.1269 0.259957
armlymohoedemaG1_resid_ADD:9    132436381   chr9_132436381_A_G  A   G   G   1597    -0.0113841  0.0172074   -0.661585   0.508334
armlymohoedemaG1_resid_ADD:11   92329626    chr11_92329626_C_T  T   C   C   1597    0.0130181   0.0279289   0.466117    0.641196
armlymohoedemaG1_resid_ADD:17   70243611    chr17_70243611_A_G  G   A   A   1597    0.00269102  0.026312    0.102273    0.918553
armlymohoedemaG1_resid_ADD:18   22120936    chr18_22120936_C_T  T   C   C   1597    -0.00589036 0.0233133   -0.25266    0.800564
armlymohoedemaG1_resid_ADD:19   41860296    rs1800469   A   G   A   1597    -0.0114224  0.0176217   -0.648198   0.516951
AtrophyG1_resid_ADD:6   31568469    rs2857595   G   A   A   1459    -0.0562461  0.0575011   -0.978175   0.328152
AtrophyG1_resid_ADD:8   3059828 chr8_3059828_A_G    A   G   G   1459    0.0902879   0.0561128   1.60904 0.107826

How would I merge the two files in R, but I do not want the second file to be attached at the bottom of the first file but rather the armlymphodema rows to be underneath the armlymphodema in file 1. The AtrophyG1 to be underneath AtrophyG1 in file1. Another layer of complexity is that I want the numbers to be in order e.g. The second file has numbers
6,8,9,9,11,17,18,19 at the end of armlymphodema in column1. I want this to go in file 1 but in the correct order.

#Endpoint:CHROM POS ID  REF ALT A1  OBS_CT  BETA    SE  T_STAT  P
armlymohoedemaG1_resid_ADD:1    15507197    rs12145963  C   T   T   1597    0.0200969   0.0177223   1.13399 0.25697
armlymohoedemaG1_resid_ADD:1    20955183    rs10916837:20955183:A:C A   C   C   1597    -0.0171274  0.0358423   -0.477853   0.632821
armlymohoedemaG1_resid_ADD:1    59499623    rs12088220:59499623:C:G C   G   G   1597    0.0380941   0.0318187   1.19722 0.2314
armlymohoedemaG1_resid_ADD:1    157426570   rs12562052:157426570:G:A    G   A   A   1597    -0.0206825  0.0197705   -1.04613    0.29566
armlymohoedemaG1_resid_ADD:1    208935416   rs12565978:208935416:T:C    T   C   C   1597    0.0125627   0.0249297   0.503926    0.614384
armlymohoedemaG1_resid_ADD:1    210570783   rs926581:210570783:T:C  T   C   T   1597    0.0100879   0.0171234   0.589128    0.555859
armlymohoedemaG1_resid_ADD:2    67992075    rs730424:67992075:G:A   G   A   A   1597    0.0576296   0.0400781   1.43793 0.150651
...
armlymohoedemaG1_resid_ADD:6    31568469    rs2857595   G   A   A   1597    0.0312277   0.0236413   1.3209  0.186727
armlymohoedemaG1_resid_ADD:8    3059828 chr8_3059828_A_G    A   G   G   1597    -0.0239881  0.0234187   -1.02432    0.305842
armlymohoedemaG1_resid_ADD:9    23658011    chr9_23658011_G_T   T   G   T   1597    -0.0211319  0.0187522   -1.1269 0.259957
armlymohoedemaG1_resid_ADD:9    132436381   chr9_132436381_A_G  A   G   G   1597    -0.0113841  0.0172074   -0.661585   0.508334
armlymohoedemaG1_resid_ADD:11   92329626    chr11_92329626_C_T  T   C   C   1597    0.0130181   0.0279289   0.466117    0.641196
armlymohoedemaG1_resid_ADD:17   70243611    chr17_70243611_A_G  G   A   A   1597    0.00269102  0.026312    0.102273    0.918553
armlymohoedemaG1_resid_ADD:18   22120936    chr18_22120936_C_T  T   C   C   1597    -0.00589036 0.0233133   -0.25266    0.800564
armlymohoedemaG1_resid_ADD:19   41860296    rs1800469   A   G   A   1597    -0.0114224  0.0176217   -0.648198   0.516951

I know of functions such as cbind, rbind, merge but I am at a loss as what to do when matching by variable name then combining so that it's intertwined with that variable name. I am also assuming I have to include some sort of partial match e.g. mydataframe[str_detect(Endpoint:CHROM, "armlymohoedemaG1_resid_ADD"),] to do a match.

CodePudding user response:

You could use filter technique to separate the files into chunks and then glue them together. Example: Let df1 be the first data frame and df2 the second one

# all lines starting with armlymohoedemaG1_resid_ADD from df1 and df2:
chunk11 <- df1[startsWith(df1$CHROM, "armlymohoedemaG1_resid_ADD"), ]
chunk12 <- df2[startsWith(df2$CHROM, "armlymohoedemaG1_resid_ADD"), ]

# now glue them together
chunk1 <- rbind(chunk11, chunk12)

# optionally order them by number to get in sequence :1, :2, :3 ...
chunk1 <- chunk1[order(chunk1$CHROM), ]

# repeat this for all CHROM levels
# at the end glue everything together
df <- rbind(chunk1, chunk2, ....)

If you do not need the chunks separated, you can glue and sort in two steps:

df <- rbind(df1, df2)
df <- df[order(df$CHROM), ]

Maybe that's good enough for your application

  • Related