Home > Enterprise >  Assigning a value and sum up if two strings match in two different data frames in R
Assigning a value and sum up if two strings match in two different data frames in R

Time:12-22

I have conducted a large survey (consisting of 42 subsurveys for different treatments) and have trouble getting my data in shape.

I have ~ 16 000 answers, each answer (i.e. what is an alternative use for a newspaper) is a cell in a data frame. These answers are in the form of Data 1 (below).

Depending on how many times an answer is given, it is worth 0-6 points (the more points, the less people have thought of it, the more creative the answer). This list is identical in form to Data 2 (listed below).

Now I want to sum each row of each of the 42 surveys (=participant) according to the breakdown in Data 2. This score should be an extra column in the dataframe called "score".

Simple example:

Participant 1 answers: "schuhe", "basteln", ... => score = 1 0 points = 1

Participant 2 answers: "brennmaterial", "schiff", ... => score = 1 1 points = 2

So the code should do:

  1. If the string in cell x of Data 1 (e.g. "schuhe") matches a string in column 1 in Data 2 (here: "schuhe") (it always does, since data 2 is a table created out of data 1), pick the value corresponding to that matching string in data 2 of column "points", keep it in memory or assign to variable.
  2. Go to next cell in row, do step 1.
  3. If all columns of name "mycolumns" are done in one row, sum up the points.
  4. Write the sum of points in column "score" in data 1.
  5. Repeat for next row.

Data 1: The answers given in 1 of 42 surveys (snippet):

structure(list(id = c("1", "2", "3", "4", "7"), kreazeitung_SQ001 = c("fensterglasersatz", 
"dämmmaterial", "klopapier", NA, NA), kreazeitung_SQ002 = c("einwickeln", 
"brennmaterial", "feueranzünder", "putzlappen", "schlagen"), 
    kreazeitung_SQ003 = c("mülleimer", "flieger", "brennmaterial", 
    "brennmaterial", "abdecken"), kreazeitung_SQ004 = c("schuhe", 
    "regenschirm", "basteln", "pappmaschee", "unterlage")), class = "data.frame", row.names = c(NA, 
-5L))

Data 2: Here are the points that each answer is worth, e.g. if in data 1 the answer is "hut", the column "points" in this data tell me that it is worth 0 points, if it is "schuhe", it should be scored as 1.

structure(list(Var1 = c("basteln", "einwickeln", "abdecken", 
"falten", "schlagen", "feueranzünder", "hut", "unterlage", "collage", 
"fliegenklatsche", "geschenkpapier", "pappmaschee", "zerreißen", 
"brennmaterial", "schiff", "schuhe"), Freq = c(57L, 55L, 46L, 
45L, 43L, 42L, 42L, 42L, 41L, 41L, 41L, 41L, 40L, 39L, 39L, 39L
), points = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1)), row.names = c(9L, 
30L, 1L, 42L, 151L, 47L, 81L, 192L, 20L, 53L, 67L, 126L, 211L, 
16L, 150L, 156L), class = "data.frame")

I am happy about all solutions, either base or tidyverse. Unfortunately, this code complexity is over my head, so I would be thrilled about any help!! Thank you!

CodePudding user response:

if I've understood, then the values in d2$Var1 match up to all the column Q's in d1. If so I think this will work, where d1 is data 1 and d2 is data2

# using data.table package for operations
library(data.table)
d1 <- as.data.table(d1)
d2 <- as.data.table(d2)

# convert from wide format to long
d1_long <- melt(d1, id.vars = "id")
# then can use merge operations to pull the points across
d1_long <- merge(d1_long, d2, by.x="value", by.y="Var1", all.x=TRUE)
# lots of missing values in the e.g., so filled with 0
d1_long[is.na(points), points := 0]

# aggregate the scores, by id
scores <- d1_long[, .(score=sum(points)), by=id]
# add them back in to the original data, sort=FALSE preserves order
d1 <- merge(d1, scores, by="id", sort=FALSE)
d1
   id kreazeitung_SQ001 kreazeitung_SQ002 kreazeitung_SQ003
1:  1 fensterglasersatz        einwickeln         mülleimer
2:  2      dämmmaterial     brennmaterial           flieger
3:  3         klopapier     feueranzünder     brennmaterial
4:  4              <NA>        putzlappen     brennmaterial
5:  7              <NA>          schlagen          abdecken
   kreazeitung_SQ004 score
1:            schuhe     1
2:       regenschirm     1
3:           basteln     1
4:       pappmaschee     1
5:         unterlage     0

# to convert back to data.frame
d1.df <- as.data.frame(d1)
  • Related