I have two unrelated datasets in R.
Dataset A contains hundreds of words given as responses in an experiment. Some of the words come up several times, there are 25 responses per participant, with 112 participants. It is formatted like this:
ID Duration Gender Age response1 response1_rt response2 response2_rt response3 response3_rt ...
1 2910 2 26 word 3.55 apple 2.89 house 4.21 ...
2 1873 1 20 paper 2.53 hungry 3.25 tent 2.55 ...
... ... ... ... ... ... ... ...
112 3420 2 18 pen 1.83 salad 4.26 family 3.22 ...
Dataset B contains thousands of words which have been rated on various dimensions. It is formatted like this:
Word measure1 measure2 measure3 ...
1 aardvark 6.26 2.21 19.00 ...
2 abalone 5.3 1.59 20.00 ...
... ... ... ... ... ...
8690 paper 5.42 1.62 40 ...
... ... ... ... ... ...
13915 zucchini 6.3 2.36 20 ...
I want to add new columns to Dataset A beside each response, showing the value given to the word in Dataset B measure1, or NA if the word doesn't appear in the dataset...
So for example:
ID Duration Gender Age response1 measure1.r1 response1_rt response2 measure1.r2 response2_rt response3...
1 2910 2 26 word 5.05 3.55 apple NA 2.89 house
2 1873 1 20 paper 5.42 2.53 hungry 3.54 3.25 tent
...
112 3420 2 18 pen 5.63 1.83 salad 6.35 4.26 family
Any advice on how I can do this?
EDIT: Example data
dataset_a <- structure(list(Duration = c("2910", "3515", "5096",
"2702", "2175", "2162"), Gender = c("2", "2", "2", "2", "2", "3"
), Age = c("19", "18", "38", "27", "20", "20"), response1 = c("Clock",
"clock", "clock", "space", "clock", "clock"), response1_rt = c("4.781",
"4.24", "12.351", "3.841", "5.912", "6.658"), response2 = c("Dress",
"human", "cat", "woman", "stick figure", "hood"), response2_rt = c("4.77",
"4.476", "4.214", "4.494", "3.735", "4.998"), response3 = c("2002",
"date", "party", "2022", "calendar", "month"), response3_rt = c("3.557",
"2.82", "7.517", "4.371", "4.868", "4.219")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
dataset_b <- structure(list(Word = c("abandon", "alarm_clock", "calendar", "chip",
"doll", "dress", "fig", "hotel", "human", "kayak", "light", "part",
"string", "woman", "zone"), measure1 = c("6.26",
"5.3", "2.84", "2.63", "5.85", "5.43", "4.48", "2.42", "2.05",
"5.52", "5.57", "4", "5.15", "3.53", "3.05"), measure2 = c(2.21,
1.59, 1.54, 1.74, 1.69, 1.75, 1.59, 1.61, 1.31, 1.75, 1.75, 1.29,
1.79, 1.22, 1.81), measure3 = c(19, 20, 19, 30, 20, 21, 23,
19, 36, 21, 23, 19, 20, 40, 19)), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"), problems = structure(list(expected = c("a double", "1/0/T/F/TRUE/FALSE", "a double",
"1/0/T/F/TRUE/FALSE", "a double", "1/0/T/F/TRUE/FALSE", "a double",
"1/0/T/F/TRUE/FALSE"), actual = c("entering", "12", "auto", "10",
"B", "8", "State", "13")), row.names = c(NA, -8L
), class = c("tbl_df", "tbl", "data.frame")))
CodePudding user response:
Here's a solution in the tidyverse
. To avoid some tortuous parsing, I have left the columns with slightly different names and in a different order from what you depicted. However, the fundamental structure and content of the data is as requested.
Solution
library(tidyverse)
# ...
# Code to generate your 'responses' and 'words' datasets.
# ...
# Filter the desired measures.
words <- words %>%
# Pivot the 'measure*' columns into the two columns:
# - 'm': which measure for that word
# - 'measure': the value measured
pivot_longer(
cols = matches("measure\\d "),
names_to = c(".value", "m"),
names_pattern = "(measure)(\\d )"
) %>%
# Filter down to desired measures by number.
filter(m %in% c(
# You want only measure #1.
1 #, 2, 3, ...
)) %>%
# Pivot back to the original 'measure*' format, but with only the desired
# columns.
pivot_wider(
names_from = m,
values_from = measure,
names_glue = "{.value}{.name}"
)
# Match the measures into the responses.
measured_responses <- responses %>%
# Pivot the 'response*' columns into two columns:
# - 'n': which response for that ID
# - 'response': the word itself
pivot_longer(
cols = matches("^response\\d $"),
names_to = c(".value", "n"),
names_pattern = "^(response)(\\d )$"
) %>%
# Match by word the responses with their measures; and retain (via LEFT join)
# the responses that lack matches.
left_join(
words,
by = c("response" = "Word")
) %>%
# Pivot the dataset back into its original 'response*' format, now with the
# new 'measure*.*' columns included.
pivot_wider(
names_from = n,
values_from = c(response, matches("measure\\d ")),
names_sep = "."
)
# View the results.
measured_responses
Results
Given a responses
dataset like this
responses <- tibble::tribble(
~ID, ~Duration, ~Gender, ~Age, ~response1, ~response1_rt, ~response2, ~response2_rt, ~response3, ~response3_rt, # ...
1, 2910, 2, 26, "word", 3.55, "apple", 2.89, "house", 4.21, # ...
2, 1873, 1, 20, "paper", 2.53, "hungry", 3.25, "tent", 2.55, # ...
112, 3420, 2, 18, "pen", 1.83, "salad", 4.26, "family", 3.22 # ...
)
and a words
dataset like this
words <- tibble::tribble(
~Word, ~measure1, ~measure2, ~measure3, # ...
"apple", 5.19, 1.84, 25.00, # ...
"paper", 5.42, 1.62, 40, # ...
"pen", 5.63, 1.71, 20, # ...
"salad", 6.07, 2.19, 40, # ...
"tent", 4.96, 1.55, 30, # ...
"word", 5.05, 2.26, 19.00, # ...
)
this solution should produce a measured_responses
dataset like this:
# A tibble: 3 x 13
ID Duration Gender Age response1_rt response2_rt response3_rt response.1 response.2 response.3 measure1.1 measure1.2 measure1.3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1 2910 2 26 3.55 2.89 4.21 word apple house 5.05 5.19 NA
2 2 1873 1 20 2.53 3.25 2.55 paper hungry tent 5.42 NA 4.96
3 112 3420 2 18 1.83 4.26 3.22 pen salad family 5.63 6.07 NA