Home > database >  Recoding in R and making between group comparisons
Recoding in R and making between group comparisons

Time:10-03

I have a dataframe with a column that has "Yes", "No" and "Maybe" values. Here is a sample of how the dataframe looks like for context (not actual data I'm working with as that's more sensitive):

State City Do you like the color Blue? Yes Maybe No
Arizona Phoenix Yes 1 0 0
Arizona Phoenix Yes 1 0 0
Arizona Phoenix Maybe 0 1 0
Arizona Phoenix No 0 0 1
Arizona Scottsdale No 0 0 1
Arizona Scottsdale Yes 1 0 0
Arizona Scottsdale Maybe 0 1 0
California San Francisco Yes 1 0 0
California San Francisco No 0 0 1
California San Francisco Maybe 0 1 0
California Los Angeles Yes 1 0 0
California Los Angeles Yes 1 0 0
California Los Angeles No 0 0 1

This is a two part question:

  1. I would like to convert the "Yes" and "Maybe" in the "Do you like the color Blue?" column to equal 1 and the "No" to equal 0 (so categorical to numeric) and add it as a separate column.

  2. I want to also make between states statistical comparisons as well (e.g. proportion of those who said "No" in California versus in Arizona). I was thinking of subsetting the data set by state and then making the comparisons, but the data set I'm working with has about 15 states. Is there a faster/more efficient way to do so?

CodePudding user response:

We may do this easily with model.matrix from base R

df1[c("Maybe", "No", "Yes")] <- model.matrix(~ df1$Do_you_like_the_color_Blue - 1)

Or using tidyverse

library(dplyr)
df1 %>%
    mutate(new_col =  (Do_you_like_the_color_Blue == "Yes")) %>%
    group_by(State) %>%
    mutate(Prop = mean(!new_col)) %>%
    ungroup

data

df1 <- structure(list(State = c("Arizona", "Arizona", "Arizona", "Arizona", 
"Arizona", "Arizona", "Arizona", "California", "California", 
"California", "California", "California", "California"), City = c("Phoenix", 
"Phoenix", "Phoenix", "Phoenix", "Scottsdale", "Scottsdale", 
"Scottsdale", "San Francisco", "San Francisco", "San Francisco", 
"Los Angeles", "Los Angeles", "Los Angeles"), Do_you_like_the_color_Blue = c("Yes", 
"Yes", "Maybe", "No", "No", "Yes", "Maybe", "Yes", "No", "Maybe", 
"Yes", "Yes", "No"), Yes = c(1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 
0L, 0L, 1L, 1L, 0L), Maybe = c(0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 0L), No = c(0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 
0L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -13L
))

CodePudding user response:

To create the columns, we can use the fastDummies package, wich is usually much faster to write than recode or pivoting wider

library(fastDummies)
library(dplyr)

df %>% dummy_cols(`Do you like the color Blue?`)
  • Related