There are many related questions here about this issue, particulalry using left_join
from dplyr
, but I still can't figure it out.
All I want to do is return LanguageClean
in Lookup
based on a match to the Language
column in df
. If there is no match, simply return NA
. I want LanguageClean
added as a new column to df
.
I can see that my code below is replicating the ID
, but I don't want it to. The ID
column is irrelevant for my purposes here, although I need to retain it in the final dataframe.
df <- structure(list(ID = structure(c(18L, 89L, 42L, 161L, 88L, 71L,
175L, 181L, 133L, 56L, 18L, 89L, 42L, 161L, 88L, 71L, 175L, 181L,
133L, 56L, 18L, 89L, 42L, 161L, 88L, 71L, 175L, 181L, 133L, 56L
), .Dim = c(10L, 3L)), Language = c("en", "", "lv", "en", "en",
"de", "en", "ms", "", "en"), Geo = c("us", "", "-", "us",
"us", "gb", "ca", "us", "-", "us")), class = "data.frame", row.names = c(NA,
-10L))
lookup <- structure(list(Language = c("af", "ar", "ar", "ar", "ar", "ar",
"ar", "ar", "ar", "eu", "be", "zh", "zh", "hr", "da", "nl", "en",
"en", "en", "en", "en", "en", "fo", "fi", "fr", "fr", "gd", "de",
"de", "de", "he", "hu", "id", "it", "ko", "lv", "mk", "mt", "no",
"pt", "rm", "ro", "ru", "sr", "sk", "sb", "es", "es", "es", "es",
"es", "es", "es", "es", "es", "sx", "sv", "ts", "tr", "ur", "vi",
"ji", "sq", "ar", "ar", "ar", "ar", "ar", "ar", "ar", "ar", "bg",
"ca", "zh", "zh", "cs", "nl", "en", "en", "en", "en", "en", "en",
"et", "fa", "fr", "fr", "fr", "ga", "de", "de", "el", "hi", "is",
"it", "ja", "ko", "lt", "ms", "no", "pl", "pt", "ro", "ru", "sz",
"sr", "sl", "es", "es", "es", "es", "es", "es", "es", "es", "es",
"es", "sv", "th", "tn", "uk", "ve", "xh", "zu"), LanguageClean = c("Afrikaans",
"Arabic", "Arabic", "Arabic", "Arabic", "Arabic", "Arabic", "Arabic",
"Arabic", "Basque", "Belarusian", "Chinese", "Chinese", "Croatian",
"Danish", "Dutch", "English", "English", "English", "English",
"English", "English", "Faeroese", "Finnish", "French", "French",
"Gaelic", "German", "German", "German", "Hebrew", "Hungarian",
"Indonesian", "Italian", "Korean", "Latvian", "Macedonian", "Maltese",
"Norwegian", "Portuguese", "Rhaeto-Romanic", "Romanian", "Russian",
"Serbian", "Slovak", "Sorbian", "Spanish", "Spanish", "Spanish",
"Spanish", "Spanish", "Spanish", "Spanish", "Spanish", "Spanish",
"Sutu", "Swedish", "Tsonga", "Turkish", "Urdu", "Vietnamese",
"Yiddish", "Albanian", "Arabic", "Arabic", "Arabic", "Arabic",
"Arabic", "Arabic", "Arabic", "Arabic", "Bulgarian", "Catalan",
"Chinese", "Chinese", "Czech", "Dutch", "English", "English",
"English", "English", "English", "English", "Estonian", "Farsi",
"French", "French", "French", "Irish", "German", "German", "Greek",
"Hindi", "Icelandic", "Italian", "Japanese", "Korean", "Lithuanian",
"Malaysian", "Norwegian", "Polish", "Portuguese", "Romanian",
"Russian", "Sami", "Serbian", "Slovenian", "Spanish", "Spanish",
"Spanish", "Spanish", "Spanish", "Spanish", "Spanish", "Spanish",
"Spanish", "Spanish", "Swedish", "Thai", "Tswana", "Ukrainian",
"Venda", "Xhosa", "Zulu")), class = "data.frame", row.names = c(NA,
-124L))
df <- left_join(df, lookup, by="Language")
CodePudding user response:
The issue is that your lookup table contains multiple entries for some languages. Hence you end up with multiple matches. So solve your issue you could filter out the distinct or unique combinations from your lookup using dplyr::distinct
:
library(dplyr)
df <- left_join(df, distinct(lookup, Language, LanguageClean), by = "Language")
df
#> ID.1 ID.2 ID.3 Language Geo LanguageClean
#> 1 18 18 18 en us English
#> 2 89 89 89 <NA>
#> 3 42 42 42 lv - Latvian
#> 4 161 161 161 en us English
#> 5 88 88 88 en us English
#> 6 71 71 71 de gb German
#> 7 175 175 175 en ca English
#> 8 181 181 181 ms us Malaysian
#> 9 133 133 133 - <NA>
#> 10 56 56 56 en us English