Home > Software engineering >  R: Replacing values if present in another dataframe
R: Replacing values if present in another dataframe

Time:02-23

I have multiple datasets that I would like to visualize in R. Unfortunately, the nomenclatur across datasets is not consistent or uses synonyms (e.g. "apple" is spelled "apple", "Apple" and "APPLE").

I have a dataframe that references the nomenclatur across datasets:

Name Dataset A Name Dataset B Name Dataset C
Apple APPLE apple
Pear PEAR NA
Melon NA melon

I would like to make things consistent, e.g. to iterate through datasets B and C and replace their nomenclatur with that of dataset A (if available). Would anyone have any recommendations?

Thanks in advance!

CodePudding user response:

If you only want to modify the capitalization of some characters, perhaps you can convert the data to a list and then apply a function recursively. You can try something like this:

df = data.frame(col1 =c("Apple", "Pear", "Melon"))

df1 = data.frame(col1 =c("APPLE", "PEAR", NA))

df2 = data.frame(col1 =c("apple", NA, "melon"))

dflist = mget(ls(pattern = "df")) # Put all the data frames in a list

Then you can apply the functions to each element, e.g., transform all the words to lower case using rapply

thelist = rapply(dflist, tolower, how = "list")

Output

$df
$df$col1
[1] "apple" "pear"  "melon"


$df1
$df1$col1
[1] "apple" "pear"  NA     


$df2
$df2$col1
[1] "apple" NA      "melon"

Additional string manipulation can be applied to the list, e.g., searching for a pattern and replace using gsub() and lapply():

thelist2 = lapply(thelist, "[[", "col1") |> # Extracting "col1"
    lapply(\(x) gsub('apple', 'pink lady', x)) # Replace 'apple' with 'pink lady'

Output

$df
[1] "pink lady" "pear"      "melon"    

$df1
[1] "pink lady" "pear"      NA         

$df2
[1] "pink lady" NA          "melon"  

You can also have a similar approach using rapply:

thelist3 = rapply(thelist, \(x) gsub('apple', 'pink lady', x), how = 'list')

Depending on the structure of your data, you can also join the data frames and then apply the functions as needed.

CodePudding user response:

If you have names and data like this:

df_names = data.frame(names_for_a = c("apple", "orange"),
                      names_for_b = c("pink lady", "ORANGE"))
df_b = data.frame(index = 1:9, name = rep(c("pink lady", "ORANGE", "ORANGE"), 3))

I'd do something like:

df_b$name = df_names$names_for_a[match(df_b$name, df_names$names_for_b)]

This thread may be helpful for what you are doing too: Replace values in a dataframe based on lookup table

  • Related