I have a DF that contains sources from the traffic of a webpage. the column source
contains the detailed source, while global_sources
is a handmade categorization that doesn't change. What I'm trying to do is, if global_sources
is NA, string detect the unique values of global_sources
in source
and replace NA with the string that matched.
What the DF lookslike:
source <- c('facebook / cpc', 'googletest', 'adwords.google', 'organic', 'source / referral', 'google / source', 'facebook / test')
global_sources <- c('facebook', 'google', NA, 'organic', 'referral', NA, NA)
df <- data.frame(source, global_sources)
df
source | global_sources |
---|---|
facebook /cpc | |
googletest | |
adwords.google | NA |
organic | organic |
source / referral | referral |
google / referral | |
facebook / test | NA |
What I currently have:
df$global_sources <- ifelse(is.na(df$global_sources),
ifelse(str_detect(df$source, 'facebook'), 'facebook', NA), df$global_sources)
df
source | global_sources |
---|---|
facebook /cpc | |
googletest | |
adwords.google | NA |
organic | organic |
source / referral | referral |
google / referral | NA |
facebook / test |
Up to this point the code only detects and replaces NA values in global_sources
if the string in sources matches 'facebook', otherwise it leaves it as NA. The problem is that I need to do it for all the unique categories in global_sources
. I tried doing it with another ifelse
, but there are so many categories that it ends up in something super unefficent and hard to read.
Expected outcome:
I'm trying to do a for loop
but I haven't been able to do anything that works. The intended outcome is:
source | global_sources |
---|---|
facebook /cpc | |
googletest | |
adwords.google | |
organic | organic |
source / referral | referral |
google / referral | |
facebook / test |
Note that some values in source have more than one category (like google / referral) but in those cases the last priority is matching with referral.
CodePudding user response:
This works for the scope of your example, but you'll have to tinker the regex
for more use cases. If there's some kind of hierarchy, this won't work, but we'll need more information:
df %>% mutate(global_sources = if_else(is.na(global_sources),
str_extract(source, 'facebook|google') , global_sources))
# A tibble: 7 × 2
# source global_sources
# <chr> <chr>
# 1 facebook / cpc facebook
# 2 googletest google
# 3 adwords.google google
# 4 organic organic
# 5 source / referral referral
# 6 google / source google
# 7 facebook / test facebook
CodePudding user response:
You should be able to do this with a single application of str_detect()
when combined with a matching function. This assumes that when NA
is encountered there will be exactly 1 match in global_sources
. If that's not true, you'll probably need something more complex.
library(tidyverse)
df <- data.frame(source = c('facebook / cpc', 'googletest', 'adwords.google', 'organic', 'source / referral', 'google / source', 'facebook / test'),
global_sources = c('facebook', 'google', NA, 'organic', 'referral', NA, NA))
# pull out unique options for global_sources
g <- unique(na.omit(df$global_sources))
df %>%
rowwise() %>%
mutate(global_sources = if_else(
is.na(global_sources),
g[str_detect(source, g)],
global_sources
)) %>%
ungroup()
#> # A tibble: 7 × 2
#> source global_sources
#> <chr> <chr>
#> 1 facebook / cpc facebook
#> 2 googletest google
#> 3 adwords.google google
#> 4 organic organic
#> 5 source / referral referral
#> 6 google / source google
#> 7 facebook / test facebook
Created on 2022-11-17 with reprex v2.0.2