Count multi-response answers aginst a vector in R-CodePudding

I have a multi-response question from a survey.

The data look like this:

|respondent| friend          |
|----------|-----------------|
| 001      | John, Mary      |
|002       | Sue, John, Peter|

Then, I want to count, for each respondent, how many male and female friends they have. I imagine I need to create separate vectors of male and female names, then check each cell in the friend column against these vectors and count.

Any help is appreciated.

CodePudding user response：

This should be heavily caveated, because many common names are frequently used by different genders. Here I use the genders applied in american social security data in the babynames package as a proxy. Then I merge that with my data and come up with a weighted count based on likelihood. In the dataset, fairly common names including Casey, Riley, Jessie, Jackie, Peyton, Jaime, Kerry, and Quinn are almost evenly split between genders, so in my approach those add about half a female friend and half a male friend, which seems to me the most sensible approach when the name alone doesn't add much information about gender.

library(tidyverse) # using dplyr, tidyr
gender_freq <- babynames::babynames %>%
  filter(year >= 1930) %>% # limiting to people <= 92 y.o.
  count(name, sex, wt = n) %>%
  group_by(name) %>%
  mutate(share = n / sum(n)) %>%
  ungroup()

tribble(
  ~respondent, ~friend,
  "001", "John, Mary, Riley",
  "002", "Sue, John, Peter") %>%
  separate_rows(friend, sep = ", ") %>%
  left_join(gender_freq, by = c("friend" = "name")) %>%
  count(respondent, sex, wt = share)


## A tibble: 4 x 3
#  respondent sex       n
#  <chr>      <chr> <dbl>
#1 001        F      1.53
#2 001        M      1.47
#3 002        F      1.00
#4 002        M      2.00

CodePudding user response：

Assuming you have a list that links a name with gender, you can split up your friend column, merge the result with your list and summarise on the gender:

library(tidyverse)
df <- tibble(
  respondent = c('001', '002'), 
  friend = c('John, Mary', 'Sue, John, Peter')
)

names_df <- tibble(
  name = c('John', 'Mary', 'Sue','Peter'),
  gender = c('M', 'F', 'F', 'M')
)

df %>% 
  mutate(friend = strsplit(as.character(friend), ", ")) %>% 
  unnest(friend) %>% 
  left_join(names_df, by = c('friend' = 'name')) %>% 
  group_by(respondent) %>% 
  summarise(male_friends = sum(gender == 'M'), 
            female_friends = sum(gender == 'F'))

resulting in

# A tibble: 2 x 3
  respondent male_friends female_friends
* <chr>             <int>          <int>
1 001                   1              1
2 002                   2              1