Home > database >  How to join multiple dataframes in R but exclude some columns?
How to join multiple dataframes in R but exclude some columns?

Time:11-12

I have 12 dataframes in total with differing number of rows and columns. For example the first one looks like:

   id age   bmi  surgerytype1 surgerytype2 surgerytype3 telangiectasia_G1
1 RQ00001-2  62 29.00                    0                    1                    0                      
2 RQ00002-0  47 21.72                    1                    0                    0                      
3 RQ00004-9  56 28.04                    1                    0                    0                      
4 RQ00005-4  63 31.23                    1                    0                    0                      
5 RQ00006-5  68 21.94                    1                    0                    0                      
6 RQ00007-7  55 22.84                    1                    0                    0 
                     
  cvd rt_axilla **telangiectasia_24_logistic_resid**
1   1         0                        1.6703390
2   0         0                       -0.2387378
3   0         0                       -0.3638869
4   0         0                       -0.4635757
5   0         0                       -0.3235804
6   1         0                       -0.3815662

Most of the dataframes follows these formats but they may have an extra column or one less column. In addition, they may have slightly different number of rows. I want to merge all these data-frames together by 'id'. All of these data-frames have 'id' in column 1 which should be similar BUT NOT identical across all data-frames.

In addition the columns I am particularly interested in, is the id column and last columns e.g. telangiectasia_24_logistic_resid (from the above data-frame - also highlighted in bold). All the last columns from every one of the 12 dataframes I am interested in and want to retain.

I want my final dataframe to have the id in the first column and then the last columns of the other dataframes e.g.

id  last_colfrom_dataframe_1 last_colfrom_dataframe_2 last_colfrom_dataframe_3 etc...
RQ00001-2     <-- DATA/VALUES  -->
RQ00002-0 
RQ00004-9 
RQ00005-4 
RQ00006-5 
RQ00007-7

I don't mind doing a full merge but this will retain extra columns i am not intrested in (e.g. age bmi surgerytype1). I just want id and the last columns from all the dataframes.

I have tried this merge as well as full_join command in R but it does not work.

residual_tox <- merge(resid_ID_telang_g1, resid_ID_atrophyg1, resid_ID_atrophyg2, resid_ID_NipRetract_G1, NipRetract_G2_resid, resid_ID_oedema1, Odema_G2_resid_ID, Indur_G1_resid_ID, Indur_outside_resid_ID, Indur_G2_resid_ID, arm_lymoG1_ID, late_hyperpig_ID, by = "id")

Thank you in advance.

CodePudding user response:

You could approach this by making a list of dataframes, each including just the ID column and the column you want to keep, and then using this list make the merged dataframe using the Reduce function.

Suppose your dataframes are called df1, df2, df3 as follows. Each has an id column, a column I want to discard and the last column I want to keep. Note the IDs don't necessarily overlap.

df1 <- data.frame(id=c(1,2), x=c("a", "b"), a=c("a", "b"))
df2 <- data.frame(id=c(2,3), y=c("a", "b"), b=c("cc", "ss"))
df3 <- data.frame(id=c(1,3), z=c("a", "b"), c=c("22", "1"))

Now make a list including each dataframe, but only its first and last column.

dfl <-  lapply(list(df1, df2, df3), function(d) d[, c("id", tail(names(d),1))])

Then the Reduce function can take this list, and sequentially apply merge until they are all combined:

Reduce(function(x1, x2) merge(x1, x2, all=TRUE, by="id") ,  dfl)

Which gives the output I think you expect:

  id    a    b    c
1  1    a <NA>   22
2  2    b   cc <NA>
3  3 <NA>   ss    1
  • Related