Replacing NA with string if string found on other variable-CodePudding

I have a DF that contains sources from the traffic of a webpage. the column source contains the detailed source, while global_sources is a handmade categorization that doesn't change. What I'm trying to do is, if global_sources is NA, string detect the unique values of global_sources in source and replace NA with the string that matched.

What the DF lookslike:

source <- c('facebook / cpc', 'googletest', 'adwords.google', 'organic', 'source / referral', 'google / source', 'facebook / test')
global_sources <- c('facebook', 'google', NA, 'organic', 'referral', NA, NA)

df <- data.frame(source, global_sources)
df

source	global_sources
facebook /cpc	facebook
googletest	google
adwords.google	NA
organic	organic
source / referral	referral
google / referral	google
facebook / test	NA

What I currently have:

df$global_sources <- ifelse(is.na(df$global_sources),
                            ifelse(str_detect(df$source, 'facebook'), 'facebook', NA), df$global_sources)

df

source	global_sources
facebook /cpc	facebook
googletest	google
adwords.google	NA
organic	organic
source / referral	referral
google / referral	NA
facebook / test	facebook

Up to this point the code only detects and replaces NA values in global_sources if the string in sources matches 'facebook', otherwise it leaves it as NA. The problem is that I need to do it for all the unique categories in global_sources. I tried doing it with another ifelse, but there are so many categories that it ends up in something super unefficent and hard to read.

Expected outcome:

I'm trying to do a for loop but I haven't been able to do anything that works. The intended outcome is:

source	global_sources
facebook /cpc	facebook
googletest	google
adwords.google	google
organic	organic
source / referral	referral
google / referral	google
facebook / test	facebook

Note that some values in source have more than one category (like google / referral) but in those cases the last priority is matching with referral.

CodePudding user response：

This works for the scope of your example, but you'll have to tinker the regex for more use cases. If there's some kind of hierarchy, this won't work, but we'll need more information:

df %>% mutate(global_sources = if_else(is.na(global_sources), 
              str_extract(source, 'facebook|google') , global_sources))

# A tibble: 7 × 2
# source            global_sources
# <chr>             <chr>         
# 1 facebook / cpc    facebook      
# 2 googletest        google        
# 3 adwords.google    google        
# 4 organic           organic       
# 5 source / referral referral      
# 6 google / source   google        
# 7 facebook / test   facebook

CodePudding user response：

You should be able to do this with a single application of str_detect() when combined with a matching function. This assumes that when NA is encountered there will be exactly 1 match in global_sources. If that's not true, you'll probably need something more complex.

library(tidyverse)

df <- data.frame(source = c('facebook / cpc', 'googletest', 'adwords.google', 'organic', 'source / referral', 'google / source', 'facebook / test'), 
                 global_sources = c('facebook', 'google', NA, 'organic', 'referral', NA, NA))

# pull out unique options for global_sources
g <- unique(na.omit(df$global_sources))

df %>% 
  rowwise() %>% 
  mutate(global_sources = if_else(
    is.na(global_sources),
    g[str_detect(source, g)],
    global_sources
  )) %>% 
  ungroup()
#> # A tibble: 7 × 2
#>   source            global_sources
#>   <chr>             <chr>         
#> 1 facebook / cpc    facebook      
#> 2 googletest        google        
#> 3 adwords.google    google        
#> 4 organic           organic       
#> 5 source / referral referral      
#> 6 google / source   google        
#> 7 facebook / test   facebook

^{Created on 2022-11-17 with reprex v2.0.2}