I have multiple datasets that I would like to visualize in R. Unfortunately, the nomenclatur across datasets is not consistent or uses synonyms (e.g. "apple" is spelled "apple", "Apple" and "APPLE").
I have a dataframe that references the nomenclatur across datasets:
Name Dataset A | Name Dataset B | Name Dataset C |
---|---|---|
Apple | APPLE | apple |
Pear | PEAR | NA |
Melon | NA | melon |
I would like to make things consistent, e.g. to iterate through datasets B and C and replace their nomenclatur with that of dataset A (if available). Would anyone have any recommendations?
Thanks in advance!
CodePudding user response:
If you only want to modify the capitalization of some characters, perhaps you can convert the data to a list and then apply a function recursively. You can try something like this:
df = data.frame(col1 =c("Apple", "Pear", "Melon"))
df1 = data.frame(col1 =c("APPLE", "PEAR", NA))
df2 = data.frame(col1 =c("apple", NA, "melon"))
dflist = mget(ls(pattern = "df")) # Put all the data frames in a list
Then you can apply the functions to each element, e.g., transform all the words to lower case using rapply
thelist = rapply(dflist, tolower, how = "list")
Output
$df
$df$col1
[1] "apple" "pear" "melon"
$df1
$df1$col1
[1] "apple" "pear" NA
$df2
$df2$col1
[1] "apple" NA "melon"
Additional string manipulation can be applied to the list, e.g., searching for a pattern and replace using gsub()
and lapply()
:
thelist2 = lapply(thelist, "[[", "col1") |> # Extracting "col1"
lapply(\(x) gsub('apple', 'pink lady', x)) # Replace 'apple' with 'pink lady'
Output
$df
[1] "pink lady" "pear" "melon"
$df1
[1] "pink lady" "pear" NA
$df2
[1] "pink lady" NA "melon"
You can also have a similar approach using rapply
:
thelist3 = rapply(thelist, \(x) gsub('apple', 'pink lady', x), how = 'list')
Depending on the structure of your data, you can also join the data frames and then apply the functions as needed.
CodePudding user response:
If you have names and data like this:
df_names = data.frame(names_for_a = c("apple", "orange"),
names_for_b = c("pink lady", "ORANGE"))
df_b = data.frame(index = 1:9, name = rep(c("pink lady", "ORANGE", "ORANGE"), 3))
I'd do something like:
df_b$name = df_names$names_for_a[match(df_b$name, df_names$names_for_b)]
This thread may be helpful for what you are doing too: Replace values in a dataframe based on lookup table