I have been trying (unsuccessfully) to use the solutions from several threads, especially: Get column name based on row values matching in DF in R (more than 3 column names) Populate a new column if a value is found in any column But when I adapt them to my specific needs, I fail... Other posts deal with matching to a specific value, or a maximum value etc., but that doesn't work for me.
So I have a variable (a date) and I need to know where that date comes from, based on other date variables. I have a very large dataset, so I need to keep the rest of that in place, and avoid using references to column numbers (e.g., df[2:3]) because my data source may change in the future, so best to stick to variable names (e.g., "date1"). Best to explain with an example:
df<-data.frame(id=as.character(c(1,2,3,4,5)),
date1=c("01/10/2022",NA,"1/10/2022","2/10/2022",NA),
date2=c("02/10/2022",NA,"3/10/2022","4/10/2022","02/10/2022"),
date3=c("05/10/2022","3/10/2022", "5/10/2022","6/10/2022","04/10/2022"),
keydate=c("01/10/2022","3/10/2022", "1/10/2022","4/10/2022","05/10/2022"))
df$date1 <- as.Date(df$date1, format="%d/%m/%y")
df$date2 <- as.Date(df$date2, format="%d/%m/%y")
df$date3 <- as.Date(df$date3, format="%d/%m/%y")
df$keydate <- as.Date(df$keydate, format="%d/%m/%y")
Then I need a new variable "datesource" defining where "keydate" comes from, in my previous example this would be:
datesource=c("date1","date3","date1","date2","none")
cbind(df,datesource)
Some of my failed attempts include:
df$datesource = ifelse(df$keydate %in% df[c(date1,date2,date3),], colnames(df), "none") #get all "none"
#OR
df %>% mutate(datesource = ifelse(df$keydate %in% df, colnames(df), "none")) #get all "none"
#OR
df$datesource <- apply(df, colnames(df[c(date1,date2,date3),]),
function(x) ifelse(any(x = df$keydate), colnames(df), 'none'))
#Error in apply(df, colnames(df[c(date1, date2, date3), ]), function(x) ifelse(any(x = df$keydate), : 'X' must have named dimnames
Thank you for help!
CodePudding user response:
Here is one option. You can map out the correct position in a vector of column names:
library(tidyverse)
df |>
mutate(datesource = pmap_chr(list(date1, date2, date3, keydate),
~c("date1", "date2", "date3")[
c(..1, ..2, ..3) %in% ..4
]))
#> id date1 date2 date3 keydate datesource
#> 1 1 2020-10-01 2020-10-02 2020-10-05 2020-10-01 date1
#> 2 2 <NA> <NA> 2020-10-03 2020-10-03 date3
#> 3 3 2020-10-01 2020-10-03 2020-10-05 2020-10-01 date1
#> 4 4 2020-10-02 2020-10-04 2020-10-06 2020-10-04 date2
#> 5 5 <NA> 2020-10-02 2020-10-04 2020-10-02 date2
EDIT include an option for "none"
library(tidyverse)
df |>
mutate(datesource = pmap_chr(list(date1, date2, date3, keydate),
~ifelse(any(c(..1, ..2, ..3) %in% ..4),
c("date1", "date2", "date3")[c(..1, ..2, ..3) %in% ..4],
"none")))
#> id date1 date2 date3 keydate datesource
#> 1 1 2020-10-01 2020-10-02 2020-10-05 2020-10-01 date1
#> 2 2 <NA> <NA> 2020-10-03 2020-10-03 date3
#> 3 3 2020-10-01 2020-10-03 2020-10-05 2020-10-01 date1
#> 4 4 2020-10-02 2020-10-04 2020-10-06 2020-10-04 date2
#> 5 5 <NA> 2020-10-02 2020-10-04 2020-10-05 none
CodePudding user response:
[Edit: I changed your example data to include a test case with no match.]
I would use which()
to identify matches and index into the column names. This might be easier if you restructure your data to long format. Here’s a tidyverse solution:
library(tidyr)
library(dplyr)
df %>%
pivot_longer(
date1:date3,
names_to = "label",
values_to = "date"
) %>%
group_by(id) %>%
mutate(datesource = ifelse(
keydate %in% date,
label[which(date == keydate)],
"none"
)) %>%
ungroup() %>%
pivot_wider(names_from = label, values_from = date)
# A tibble: 5 × 6
id keydate datesource date1 date2 date3
<chr> <date> <chr> <date> <date> <date>
1 1 2020-10-01 date1 2020-10-01 2020-10-02 2020-10-05
2 2 2020-10-03 date3 NA NA 2020-10-03
3 3 2020-10-01 date1 2020-10-01 2020-10-03 2020-10-05
4 4 2020-10-04 date2 2020-10-02 2020-10-04 2020-10-06
5 5 2020-11-02 none NA 2020-10-02 2020-10-04