Home > Blockchain >  Replacing words in Dataframe by values in another Dataframe in R
Replacing words in Dataframe by values in another Dataframe in R

Time:03-18

I'm trying to replace string values in a Dataframe by numeric values in another one. Let me create an example:

df1 <- data.frame("Time" = seq(1, 3, 1),
              "Words" = c("dog cat", "bird fish", "elephant shark turtle"))

This is my main Dataframe. It's pretty simple: The first column refers to the Time spent and the second column shows every word registered on that specific time. The other Dataframe is:

df2 <- data.frame("Words" = c("turtle", "bird", "dog", "elephant", "cat", "shark", "fish"),
              "Points" = seq(1, 7, 1))

This Dataframe refers to the Points that each Word accomplishes. My goal is to calculate the median of points on each time period. For instance, when Time = 1, "dog" equals 3 and "cat" equals 5, so the median will be 4.

I'm having a lot of trouble with this. The first thing is how to separate the words within a string. Then, I think that I have to create a loop to search in df2 the values that appears in df1 and substitute them by their values. Finally, I have to reach the median of each Time period to finish the task.

Any ideas?

CodePudding user response:

Using tidy approach

library(tidyverse)
df1 %>%
  mutate(Words = strsplit(df1$Words, " ")) %>% 
  unnest(Words) %>% 
  inner_join(df2, by="Words") %>% 
  group_by(Time) %>% 
  summarize(Score = median(Points))

Using data.table

library(data.table)
setDT(df1)[,.(Words=strsplit(Words, " ")[[1]]),by=Time] %>% 
  .[setDT(df2), on=.(Words)] %>% 
  .[,(Score = median(Points)), by=Time]

Output:

   Time Score
  <dbl> <dbl>
1     1   4  
2     2   4.5
3     3   4  
  • Related