Home > Back-end >  Identify match by row, and create column with the column name of the match
Identify match by row, and create column with the column name of the match

Time:11-05

I have been trying (unsuccessfully) to use the solutions from several threads, especially: Get column name based on row values matching in DF in R (more than 3 column names) Populate a new column if a value is found in any column But when I adapt them to my specific needs, I fail... Other posts deal with matching to a specific value, or a maximum value etc., but that doesn't work for me.

So I have a variable (a date) and I need to know where that date comes from, based on other date variables. I have a very large dataset, so I need to keep the rest of that in place, and avoid using references to column numbers (e.g., df[2:3]) because my data source may change in the future, so best to stick to variable names (e.g., "date1"). Best to explain with an example:

df<-data.frame(id=as.character(c(1,2,3,4,5)), 
               date1=c("01/10/2022",NA,"1/10/2022","2/10/2022",NA),
               date2=c("02/10/2022",NA,"3/10/2022","4/10/2022","02/10/2022"),
               date3=c("05/10/2022","3/10/2022", "5/10/2022","6/10/2022","04/10/2022"),
               keydate=c("01/10/2022","3/10/2022", "1/10/2022","4/10/2022","05/10/2022"))

df$date1 <- as.Date(df$date1, format="%d/%m/%y")
df$date2 <- as.Date(df$date2, format="%d/%m/%y")
df$date3 <- as.Date(df$date3, format="%d/%m/%y")
df$keydate <- as.Date(df$keydate, format="%d/%m/%y")

Then I need a new variable "datesource" defining where "keydate" comes from, in my previous example this would be:

datesource=c("date1","date3","date1","date2","none")
cbind(df,datesource)

Some of my failed attempts include:

df$datesource = ifelse(df$keydate %in% df[c(date1,date2,date3),], colnames(df), "none") #get all "none"
#OR
df %>% mutate(datesource = ifelse(df$keydate %in% df, colnames(df), "none")) #get all "none"
#OR
df$datesource <- apply(df, colnames(df[c(date1,date2,date3),]), 
                     function(x) ifelse(any(x = df$keydate), colnames(df), 'none')) 
#Error in apply(df, colnames(df[c(date1, date2, date3), ]), function(x) ifelse(any(x = df$keydate),  : 'X' must have named dimnames

Thank you for help!

CodePudding user response:

Here is one option. You can map out the correct position in a vector of column names:

library(tidyverse)

df |>
  mutate(datesource = pmap_chr(list(date1, date2, date3, keydate), 
                               ~c("date1", "date2", "date3")[
                                 c(..1, ..2, ..3) %in% ..4
                               ]))
#>   id      date1      date2      date3    keydate datesource
#> 1  1 2020-10-01 2020-10-02 2020-10-05 2020-10-01      date1
#> 2  2       <NA>       <NA> 2020-10-03 2020-10-03      date3
#> 3  3 2020-10-01 2020-10-03 2020-10-05 2020-10-01      date1
#> 4  4 2020-10-02 2020-10-04 2020-10-06 2020-10-04      date2
#> 5  5       <NA> 2020-10-02 2020-10-04 2020-10-02      date2

EDIT include an option for "none"

library(tidyverse)

df |>
  mutate(datesource = pmap_chr(list(date1, date2, date3, keydate), 
                               ~ifelse(any(c(..1, ..2, ..3) %in% ..4),
                                 c("date1", "date2", "date3")[c(..1, ..2, ..3) %in% ..4],
                                 "none")))
#>   id      date1      date2      date3    keydate datesource
#> 1  1 2020-10-01 2020-10-02 2020-10-05 2020-10-01      date1
#> 2  2       <NA>       <NA> 2020-10-03 2020-10-03      date3
#> 3  3 2020-10-01 2020-10-03 2020-10-05 2020-10-01      date1
#> 4  4 2020-10-02 2020-10-04 2020-10-06 2020-10-04      date2
#> 5  5       <NA> 2020-10-02 2020-10-04 2020-10-05       none

CodePudding user response:

[Edit: I changed your example data to include a test case with no match.]

I would use which() to identify matches and index into the column names. This might be easier if you restructure your data to long format. Here’s a tidyverse solution:

library(tidyr)
library(dplyr)

df %>%
  pivot_longer(
    date1:date3,
    names_to = "label",
    values_to = "date"
  ) %>%
  group_by(id) %>%
  mutate(datesource = ifelse(
    keydate %in% date,
    label[which(date == keydate)],
    "none"
  )) %>%
  ungroup() %>%
  pivot_wider(names_from = label, values_from = date)
# A tibble: 5 × 6
  id    keydate    datesource date1      date2      date3     
  <chr> <date>     <chr>      <date>     <date>     <date>    
1 1     2020-10-01 date1      2020-10-01 2020-10-02 2020-10-05
2 2     2020-10-03 date3      NA         NA         2020-10-03
3 3     2020-10-01 date1      2020-10-01 2020-10-03 2020-10-05
4 4     2020-10-04 date2      2020-10-02 2020-10-04 2020-10-06
5 5     2020-11-02 none       NA         2020-10-02 2020-10-04
  • Related