Home > Blockchain >  Replacing NA with string if string found on other variable
Replacing NA with string if string found on other variable

Time:11-17

I have a DF that contains sources from the traffic of a webpage. the column source contains the detailed source, while global_sources is a handmade categorization that doesn't change. What I'm trying to do is, if global_sources is NA, string detect the unique values of global_sources in source and replace NA with the string that matched.

What the DF lookslike:

source <- c('facebook / cpc', 'googletest', 'adwords.google', 'organic', 'source / referral', 'google / source', 'facebook / test')
global_sources <- c('facebook', 'google', NA, 'organic', 'referral', NA, NA)

df <- data.frame(source, global_sources)
df
source global_sources
facebook /cpc facebook
googletest google
adwords.google NA
organic organic
source / referral referral
google / referral google
facebook / test NA

What I currently have:

df$global_sources <- ifelse(is.na(df$global_sources),
                            ifelse(str_detect(df$source, 'facebook'), 'facebook', NA), df$global_sources)

df

source global_sources
facebook /cpc facebook
googletest google
adwords.google NA
organic organic
source / referral referral
google / referral NA
facebook / test facebook

Up to this point the code only detects and replaces NA values in global_sources if the string in sources matches 'facebook', otherwise it leaves it as NA. The problem is that I need to do it for all the unique categories in global_sources. I tried doing it with another ifelse, but there are so many categories that it ends up in something super unefficent and hard to read.

Expected outcome:

I'm trying to do a for loop but I haven't been able to do anything that works. The intended outcome is:

source global_sources
facebook /cpc facebook
googletest google
adwords.google google
organic organic
source / referral referral
google / referral google
facebook / test facebook

Note that some values in source have more than one category (like google / referral) but in those cases the last priority is matching with referral.

CodePudding user response:

This works for the scope of your example, but you'll have to tinker the regex for more use cases. If there's some kind of hierarchy, this won't work, but we'll need more information:

df %>% mutate(global_sources = if_else(is.na(global_sources), 
              str_extract(source, 'facebook|google') , global_sources))

# A tibble: 7 × 2
# source            global_sources
# <chr>             <chr>         
# 1 facebook / cpc    facebook      
# 2 googletest        google        
# 3 adwords.google    google        
# 4 organic           organic       
# 5 source / referral referral      
# 6 google / source   google        
# 7 facebook / test   facebook 

CodePudding user response:

You should be able to do this with a single application of str_detect() when combined with a matching function. This assumes that when NA is encountered there will be exactly 1 match in global_sources. If that's not true, you'll probably need something more complex.

library(tidyverse)

df <- data.frame(source = c('facebook / cpc', 'googletest', 'adwords.google', 'organic', 'source / referral', 'google / source', 'facebook / test'), 
                 global_sources = c('facebook', 'google', NA, 'organic', 'referral', NA, NA))

# pull out unique options for global_sources
g <- unique(na.omit(df$global_sources))

df %>% 
  rowwise() %>% 
  mutate(global_sources = if_else(
    is.na(global_sources),
    g[str_detect(source, g)],
    global_sources
  )) %>% 
  ungroup()
#> # A tibble: 7 × 2
#>   source            global_sources
#>   <chr>             <chr>         
#> 1 facebook / cpc    facebook      
#> 2 googletest        google        
#> 3 adwords.google    google        
#> 4 organic           organic       
#> 5 source / referral referral      
#> 6 google / source   google        
#> 7 facebook / test   facebook

Created on 2022-11-17 with reprex v2.0.2

  • Related