Home > Blockchain >  R - Merging and aligning two CSVs using common values in multiple columns
R - Merging and aligning two CSVs using common values in multiple columns

Time:02-25

I currently have two .csv files that look like this:

File 1:

Attempt Result
Intervention 1 B
Intervention 2 H

and File 2:

Name Outcome 1 Outcome 2 Outcome 3
Sample 1 A B C
Sample 2 D E F
Sample 3 G H I

I would like to merge and align the two .csvs such that the result each row of File 1 is aligned by its "result" cell, against any of the three "outcome" columns in File 2, leaving blanks or "NA"s if there are no similarities.

Ideally, would look like this:

Attempt Result Name Outcome 1 Outcome 2 Outcome 3
Intervention 1 B Sample 1 A B C
Sample 2 D E F
Intervention 2 H Sample 3 G H I

I've looked and only found answers when merging two .csv files with one common column. Any help would be very appreciated.

CodePudding user response:

I will assume that " Result " in File 1 is unique, since more File 1 rows with same result value (i.e "B") will force us to consider new columns in the final data frame.

By this way,

Attempt <- c("Intervention 1","Intervention 2")
Result <- c("B","H")
df1 <- as.data.frame(cbind(Attempt,Result))

one <- c("Sample 1","A","B","C")
two <- c("Sample 2","D","E","F")
three <- c("Sample 3","G","H","I")
df2 <- as.data.frame(rbind(one,two,three))
row.names(df2) <- 1:3
colnames(df2) <- c("Name","Outcome 1","Outcome 2","Outcome 3")


vec_at <- rep(NA,nrow(df2));vec_res <- rep(NA,nrow(df2)); # Define NA vectors

for (j in 1:nrow(df2)){
  a <- which(is.element(df1$Result,df2[j,2:4])==TRUE) # Row names which satisfy same element in two dataframes?
  if (length(a>=1)){ # Don't forget that "a" may not be a valid index if no element satify the condition
    vec_at[j] <- df1$Attempt[a] #just create a vector with wanted information
    vec_res[j] <- df1$Result[a]
  }
}

desired_df <- as.data.frame(cbind(vec_at,vec_res,df2)) # define your wanted data frame

Output:

          vec_at vec_res     Name Outcome 1 Outcome 2 Outcome 3
1 Intervention 1       B Sample 1         A         B         C
2           <NA>    <NA> Sample 2         D         E         F
3 Intervention 2       H Sample 3         G         H         I

CodePudding user response:

I wonder if you could use fuzzyjoin for something like this.

Here, you can provide a custom function for matching between the two data.frames.

library(fuzzyjoin)

fuzzy_left_join(
  df2,
  df1,
  match_fun = NULL,
  multi_by = list(x = paste0("Outcome_", 1:3), y = "Result"),
  multi_match_fun = function(x, y) {
    y == x[, "Outcome_1"] | y == x[, "Outcome_2"] | y == x[, "Outcome_3"]
  }
)

Output

      Name Outcome_1 Outcome_2 Outcome_3        Attempt Result
1 Sample_1         A         B         C Intervention_1      B
2 Sample_2         D         E         F           <NA>   <NA>
3 Sample_3         G         H         I Intervention_2      H
  •  Tags:  
  • r csv
  • Related