Replacing some data from a column with another column data only where a certain condition is met in-CodePudding

I have an extensive data frame with 30 columns and 10000 rows. Today I want to focus in two columns: languages and languages2:

languages             languages2

Spanish                  NA
Spanish                  NA
Other (specify)        French
Other (specify)        German
Other (specify)        Russian
English                  NA
Other (specify)        Portuguese
English                  NA
(...)

This is what I need:

languages

Spanish
Spanish
French
German
Russian
English
Portuguese
English
(...)

I am looking for an answer using mutate function from dplyr

CodePudding user response：

For a bigger data you may want to control for additional scenarios, like preventing replacing the data with NA and laving the Other value. Also if there may be a merit for addressing scenarios where the first column may contain Other() or Other, lang. You may want to consider using a regular expression or pre-processing the first column.

library("tidyverse")
dta <- tribble(
    ~lang1, ~lang2,
    "Spanish", NA,
    "Other", "English",
    "Other", NA
)

mutate(dta, lang1 = case_when(
    grepl("^Other,*", lang1) & !is.na(lang2) ~ lang2,
    TRUE ~ lang1
))

CodePudding user response：

Using dplyr, we could replace Other (specify) with NA, then use coalesce:

library(dplyr)

df %>%
  mutate(languages = coalesce(na_if(languages, "Other (specify)"), languages2)) %>%
  select(languages)

Output

  languages
1    Spanish
2    Spanish
3     French
4     German
5    Russian
6    English
7 Portuguese
8    English

A tidyverse option is to use str_replace_all to replace Other (specify) with the value from languages2.

library(tidyverse)

df %>%
  mutate(languages = str_replace_all(languages,"Other \\(specify\\)", languages2)) %>% 
  select(languages)

Data

df <- structure(list(languages = c("Spanish", "Spanish", "Other (specify)", 
"Other (specify)", "Other (specify)", "English", "Other (specify)", 
"English"), languages2 = c(NA, NA, "French", "German", "Russian", 
NA, "Portuguese", NA)), class = "data.frame", row.names = c(NA, 
-8L))

Benchmark

However, if you have a lot of data and need something faster, then you might consider base R, which would be faster than dplyr or data.table.

bm <- microbenchmark::microbenchmark(Konrad = mutate(df, languages = case_when(
  grepl("^Other,*", languages) & !is.na(languages2) ~ languages2,
  TRUE ~ languages
)),
langtang = df %>%
  mutate(languages=if_else(languages=="Other (specify)", languages2, languages)),
valentina = df %>%
  mutate(languages=if_else(!is.na(languages2), languages2, languages)),
andrew_stringr = df %>%
  mutate(languages = str_replace_all(languages,"Other \\(specify\\)", languages2)),
andrew_coalesce = df %>%
  mutate(languages = coalesce(na_if(languages, "Other (specify)"), languages2, 'Other (specify)')),
andrew_baseR = {df1 <- df; df1[df1$languages == "Other (specify)", "languages"] <- df1[df1$languages == "Other (specify)", "languages2" ]},
andrew_baseR_with = {df2 <- df; df2$languages <- with( df2, ifelse( languages == "Other (specify)", languages2, languages ) )},
andrew_datatable = {dt <- as.data.table(df); dt[languages == "Other (specify)", languages := languages2 ]},
times = 1000)

CodePudding user response：

data %>% 
  mutate(languages=if_else(languages=="Other (specify)", languages2, languages)

CodePudding user response：

Another possibility (if you prefer the check to be over the 2nd column instead of the 1st):

library(dplyr)

df <- data.frame(languages=c("Spanish","Spanish","Other (specify)","Other (specify)","Other (specify)","English","Other (specify)","English"),languages2=c(NA,NA,"French","German","Russian",NA,"Portuguese",NA))

df %>%
 mutate(languages=if_else(!is.na(languages2), languages2, languages))

   languages languages2
1    Spanish       <NA>
2    Spanish       <NA>
3     French     French
4     German     German
5    Russian    Russian
6    English       <NA>
7 Portuguese Portuguese
8    English       <NA>