Home > Net >  In R dataframe, consolidate 5 columns with similar names into a single dataframe column
In R dataframe, consolidate 5 columns with similar names into a single dataframe column

Time:03-15

Given the following dataframe:

zed = data.frame(Title = c('a', NA, NA, NA, NA, NA), TITLE = c(NA, 'b', NA, NA, NA, NA), "Title/Description" = c(NA, NA, 'c', 'd', NA, NA), title = c(NA, NA, NA, NA, 'e', 'f'))

We'd like to consolidate everything down into one Title column. We can assume that in each row, there will only be 1 value that is not NA, and we are hunting for that value to add into the Title column. We can consolidate this down via the following approach:

yah <- zed %>% 
  dplyr::mutate(Title = ifelse(!is.na(Title), Title,
                               ifelse(!is.na(`Title.Description`), `Title.Description`,
                                      ifelse(!is.na(TITLE), TITLE,
                                             ifelse(!is.na(title), title, ""))))) %>%
  dplyr::select(-Title.Description, -TITLE, -title)

however this approach is messy, and becomes even messier if the dataframe has 10 title columns. this approach also breaks if a column is not present in the data:

# there's no TITLE2 --> ERROR
yah <- zed %>% 
  dplyr::mutate(Title = ifelse(!is.na(Title), Title,
                               ifelse(!is.na(`Title.Description`), `Title.Description`,
                                      ifelse(!is.na(TITLE2), TITLE2,
                                             ifelse(!is.na(title), title, ""))))) %>%
  dplyr::select(-Title.Description, -TITLE2, -title)

any recommendations on how to (a) eliminate the nasty ifelse() nesting and (b) not throw an error if one of the columns is omitted.

CodePudding user response:

dplyr has the coalesce function that takes the first non-missing value:

yah <- zed %>%
  mutate(Title = coalesce(
      Title, Title.Description, TITLE2, title, ""
    )
  )

This will still break if one of the columns doesn't exist. I'd recommend identifying any missing columns and setting them to NA before running this.

CodePudding user response:

Another option is to select only the columns that contain the the word title (regardless of case), then use coalesce. I also give the new column a different name, so that I could remove all the columns that contain title, then rename it.

library(tidyverse)

yah <- zed %>%
  mutate(temp = coalesce(!!!select(., contains("title")))) %>% 
  select(-contains("title")) %>% 
  rename(Title = temp)

Output

yah

  Title
1     a
2     b
3     c
4     d
5     e
6     f

Another option is to use max with c_across:

zed %>%
  rowwise() %>%
  mutate(temp = max(c_across(contains("title")), na.rm = TRUE)) %>% 
  select(-contains("title")) %>% 
  rename(Title = temp)

Or another option is to use reduce with pmax

zed %>% 
  mutate(temp = reduce(select(., contains("title")), pmax, na.rm = TRUE)) %>% 
  select(-contains("title")) %>% 
  rename(Title = temp)
  • Related