I have a large dataframe with city names and many are misspelled. Right now I have corrected then manually, one by one, using the following code:
geo <- geo %>%
mutate(address = ifelse(hqcity == 'SMOERUM', 'Smørum, Denmark', address),
address = ifelse(hqcity == 'Staefa', 'Stäfa, Switzerland', address),
address = ifelse(hqcity == 'KOEBENHAVN' , 'København, Denmark', address),
address = ifelse(hqcity == 'Hoersholm' , 'Hørsholm, Denmark', address),
address = ifelse(hqcity == 'Buedelsdorf' , 'Büdelsdorf, Germany', address),
address = ifelse(hqcity == 'PFAEFFIKON' , 'Pfäffikon, Switzerland', address),
address = ifelse(hqcity == 'Ruemlang', 'Rümlang, Switzerland', address),
address = ifelse(hqcity == 'KILKENNY', 'Kilkenny, Ireland', address),
address = ifelse(hqcity == 'Kingscourt', 'Kingscourt, Ireland', address),
address = ifelse(hqcity == 'Soeborg', 'Søborg, Denmark', address),
address = ifelse(hqcity == 'Unterpremstaetten' , 'Unterpremstätten, Austria', address),
address = ifelse(hqcity == 'KOEBENHAVN S', 'København S, Denmark', address),
address = ifelse(hqcity == 'Maennedorf', 'Männedorf, Switzerland', address),
address = ifelse(hqcity == 'Santry' , 'Santry, Ireland', address),
address = ifelse(hqcity == 'Tralee', 'Tralee, Ireland', address),
address = ifelse(hqcity == 'KOEBENHAVN K', 'København K, Denmark', address),
address = ifelse(hqcity == 'Goeteborg', 'Göteborg, Sweden', address),
address = ifelse(hqcity == 'Goeppingen', 'Göppingen, Germany', address)
)
I want to know if somebody knows (or can imagine) a better way (with less repetitions, shorter, etc.) to make the same replacements. Thanks.
CodePudding user response:
You can use dplyr's case_when like Martin said. Would be something like this:
geo <- geo %>%
mutate(adress = case_when(
hqcity == 'SMOERUM' ~ 'Smørum, Denmark',
hqcity == 'Staefa' ~ 'Stäfa, Switzerland',
#[...]
#[...]
#[...]
TRUE ~ adress
))
I guess its better in terms of processing time but still you need to specific the conditions manually. Unless theres some rule/pattern regulating the changes.
CodePudding user response:
An alternative might be using a lookup vector. Starting with a data.frame like this
library(tibble)
df <- tibble::tribble(~a, ~b,
'SMOERUM', 'Smørum, Denmark',
'Staefa', 'Stäfa, Switzerland',
'KOEBENHAVN' , 'København, Denmark',
'Hoersholm' , 'Hørsholm, Denmark',
'Buedelsdorf' , 'Büdelsdorf, Germany',
'PFAEFFIKON' , 'Pfäffikon, Switzerland',
'Ruemlang', 'Rümlang, Switzerland',
'KILKENNY', 'Kilkenny, Ireland',
'Kingscourt', 'Kingscourt, Ireland',
'Soeborg', 'Søborg, Denmark',
'Unterpremstaetten' , 'Unterpremstätten, Austria',
'KOEBENHAVN S', 'København S, Denmark',
'Maennedorf', 'Männedorf, Switzerland',
'Santry' , 'Santry, Ireland',
'Tralee', 'Tralee, Ireland',
'KOEBENHAVN K', 'København K, Denmark',
'Goeteborg', 'Göteborg, Sweden' ,
'Goeppingen', 'Göppingen, Germany')
we create
lookup <- deframe(df)
Now use this vector inside mutate
:
library(dplyr)
geo %>%
mutate(adress = lookup[adress])
CodePudding user response:
A possible solution (my geo
dataframe has 3 wrong cases and 1 correct):
library(tidyverse)
df <- data.frame(
wrong = c("SMOERUM","Staefa",
"KOEBENHAVN","Hoersholm","Buedelsdorf",
"PFAEFFIKON","Ruemlang","KILKENNY","Kingscourt",
"Soeborg","Unterpremstaetten","KOEBENHAVN S",
"Maennedorf","Santry","Tralee","KOEBENHAVN K",
"Goeteborg","Goeppingen"),
correct = c("Smørum, Denmark",
"Stäfa, Switzerland","København, Denmark",
"Hørsholm, Denmark","Büdelsdorf, Germany",
"Pfäffikon, Switzerland","Rümlang, Switzerland",
"Kilkenny, Ireland","Kingscourt, Ireland",
"Søborg, Denmark","Unterpremstätten, Austria",
"København S, Denmark","Männedorf, Switzerland",
"Santry, Ireland","Tralee, Ireland","København K, Denmark",
"Göteborg, Sweden","Göppingen, Germany")
)
geo <- data.frame(address = c("Hoersholm", "KILKENNY",
"KOEBENHAVN K", "Göppingen, Germany"))
geo %>%
rowwise %>%
mutate(address = ifelse(address %in% df$wrong,
filter(df %>% rowwise, address %in% wrong) %>%
select(correct) %>% deframe, address)) %>%
ungroup
#> # A tibble: 4 × 1
#> address
#> <chr>
#> 1 Hørsholm, Denmark
#> 2 Kilkenny, Ireland
#> 3 København K, Denmark
#> 4 Göppingen, Germany
Another possible solution, based on purrr::map_chr
:
library(tidyverse)
geo %>%
mutate(address = map_chr(address, ~ if (.x %in% df$wrong)
{filter(df, .x == wrong) %>% select(correct) %>% deframe} else {.x}))
#> address
#> 1 Hørsholm, Denmark
#> 2 Kilkenny, Ireland
#> 3 København K, Denmark
#> 4 Göppingen, Germany