Home > front end >  Replace values conditionally, better way
Replace values conditionally, better way

Time:05-13

I have a large dataframe with city names and many are misspelled. Right now I have corrected then manually, one by one, using the following code:

geo <- geo %>% 
  mutate(address = ifelse(hqcity == 'SMOERUM', 'Smørum, Denmark', address),
         address = ifelse(hqcity == 'Staefa',  'Stäfa, Switzerland', address),
         address = ifelse(hqcity == 'KOEBENHAVN' ,  'København, Denmark', address),
         address = ifelse(hqcity == 'Hoersholm' ,  'Hørsholm, Denmark', address),
         address = ifelse(hqcity == 'Buedelsdorf' ,  'Büdelsdorf, Germany', address),
         address = ifelse(hqcity == 'PFAEFFIKON' ,  'Pfäffikon, Switzerland', address),
         address = ifelse(hqcity == 'Ruemlang',  'Rümlang, Switzerland', address),
         address = ifelse(hqcity == 'KILKENNY', 'Kilkenny, Ireland', address),
         address = ifelse(hqcity == 'Kingscourt', 'Kingscourt, Ireland', address),
         address = ifelse(hqcity == 'Soeborg', 'Søborg, Denmark', address),
         address = ifelse(hqcity == 'Unterpremstaetten' , 'Unterpremstätten, Austria', address),
         address = ifelse(hqcity == 'KOEBENHAVN S',  'København S, Denmark', address),
         address = ifelse(hqcity == 'Maennedorf', 'Männedorf, Switzerland', address),
         address = ifelse(hqcity == 'Santry' ,  'Santry, Ireland', address),
         address = ifelse(hqcity == 'Tralee',  'Tralee, Ireland', address),
         address = ifelse(hqcity == 'KOEBENHAVN K', 'København K, Denmark', address),
         address = ifelse(hqcity == 'Goeteborg', 'Göteborg, Sweden', address), 
         address = ifelse(hqcity == 'Goeppingen', 'Göppingen, Germany', address)
  )

I want to know if somebody knows (or can imagine) a better way (with less repetitions, shorter, etc.) to make the same replacements. Thanks.

CodePudding user response:

You can use dplyr's case_when like Martin said. Would be something like this:

geo <- geo %>%
        mutate(adress = case_when(
                hqcity == 'SMOERUM' ~ 'Smørum, Denmark',
                hqcity == 'Staefa' ~ 'Stäfa, Switzerland',
                #[...]
                #[...]
                #[...]
                TRUE ~ adress
        ))

I guess its better in terms of processing time but still you need to specific the conditions manually. Unless theres some rule/pattern regulating the changes.

CodePudding user response:

An alternative might be using a lookup vector. Starting with a data.frame like this

library(tibble)

df <- tibble::tribble(~a, ~b,
'SMOERUM', 'Smørum, Denmark',
'Staefa',  'Stäfa, Switzerland',
'KOEBENHAVN' ,  'København, Denmark',
'Hoersholm' ,  'Hørsholm, Denmark',
'Buedelsdorf' ,  'Büdelsdorf, Germany',
'PFAEFFIKON' ,  'Pfäffikon, Switzerland',
'Ruemlang',  'Rümlang, Switzerland',
'KILKENNY', 'Kilkenny, Ireland',
'Kingscourt', 'Kingscourt, Ireland',
'Soeborg', 'Søborg, Denmark',
'Unterpremstaetten' , 'Unterpremstätten, Austria',
'KOEBENHAVN S',  'København S, Denmark',
'Maennedorf', 'Männedorf, Switzerland',
'Santry' ,  'Santry, Ireland',
'Tralee',  'Tralee, Ireland',
'KOEBENHAVN K', 'København K, Denmark',
'Goeteborg', 'Göteborg, Sweden' ,
'Goeppingen', 'Göppingen, Germany')

we create

lookup <- deframe(df)

Now use this vector inside mutate:

library(dplyr)

geo %>%
  mutate(adress = lookup[adress])

CodePudding user response:

A possible solution (my geo dataframe has 3 wrong cases and 1 correct):

library(tidyverse)

df <- data.frame(
  wrong = c("SMOERUM","Staefa",
            "KOEBENHAVN","Hoersholm","Buedelsdorf",
            "PFAEFFIKON","Ruemlang","KILKENNY","Kingscourt",
            "Soeborg","Unterpremstaetten","KOEBENHAVN S",
            "Maennedorf","Santry","Tralee","KOEBENHAVN K",
            "Goeteborg","Goeppingen"),
  correct = c("Smørum,  Denmark",
              "Stäfa,  Switzerland","København,  Denmark",
              "Hørsholm,  Denmark","Büdelsdorf,  Germany",
              "Pfäffikon,  Switzerland","Rümlang,  Switzerland",
              "Kilkenny,  Ireland","Kingscourt,  Ireland",
              "Søborg,  Denmark","Unterpremstätten,  Austria",
              "København S,  Denmark","Männedorf,  Switzerland",
              "Santry,  Ireland","Tralee,  Ireland","København K,  Denmark",
              "Göteborg,  Sweden","Göppingen,  Germany")
)

geo <- data.frame(address = c("Hoersholm", "KILKENNY", 
        "KOEBENHAVN K", "Göppingen,  Germany"))

geo %>%
  rowwise %>%
  mutate(address = ifelse(address %in% df$wrong, 
     filter(df %>% rowwise, address %in% wrong) %>% 
     select(correct) %>% deframe, address)) %>%
  ungroup

#> # A tibble: 4 × 1
#>   address              
#>   <chr>                
#> 1 Hørsholm,  Denmark   
#> 2 Kilkenny,  Ireland   
#> 3 København K,  Denmark
#> 4 Göppingen,  Germany

Another possible solution, based on purrr::map_chr:

library(tidyverse)

geo %>%
  mutate(address = map_chr(address, ~ if (.x %in% df$wrong) 
          {filter(df, .x == wrong) %>% select(correct) %>% deframe} else {.x}))

#>                 address
#> 1    Hørsholm,  Denmark
#> 2    Kilkenny,  Ireland
#> 3 København K,  Denmark
#> 4   Göppingen,  Germany
  • Related