I have a dataframe with a column that has "Yes", "No" and "Maybe" values. Here is a sample of how the dataframe looks like for context (not actual data I'm working with as that's more sensitive):
State | City | Do you like the color Blue? | Yes | Maybe | No |
---|---|---|---|---|---|
Arizona | Phoenix | Yes | 1 | 0 | 0 |
Arizona | Phoenix | Yes | 1 | 0 | 0 |
Arizona | Phoenix | Maybe | 0 | 1 | 0 |
Arizona | Phoenix | No | 0 | 0 | 1 |
Arizona | Scottsdale | No | 0 | 0 | 1 |
Arizona | Scottsdale | Yes | 1 | 0 | 0 |
Arizona | Scottsdale | Maybe | 0 | 1 | 0 |
California | San Francisco | Yes | 1 | 0 | 0 |
California | San Francisco | No | 0 | 0 | 1 |
California | San Francisco | Maybe | 0 | 1 | 0 |
California | Los Angeles | Yes | 1 | 0 | 0 |
California | Los Angeles | Yes | 1 | 0 | 0 |
California | Los Angeles | No | 0 | 0 | 1 |
This is a two part question:
I would like to convert the "Yes" and "Maybe" in the "Do you like the color Blue?" column to equal 1 and the "No" to equal 0 (so categorical to numeric) and add it as a separate column.
I want to also make between states statistical comparisons as well (e.g. proportion of those who said "No" in California versus in Arizona). I was thinking of subsetting the data set by state and then making the comparisons, but the data set I'm working with has about 15 states. Is there a faster/more efficient way to do so?
CodePudding user response:
We may do this easily with model.matrix
from base R
df1[c("Maybe", "No", "Yes")] <- model.matrix(~ df1$Do_you_like_the_color_Blue - 1)
Or using tidyverse
library(dplyr)
df1 %>%
mutate(new_col = (Do_you_like_the_color_Blue == "Yes")) %>%
group_by(State) %>%
mutate(Prop = mean(!new_col)) %>%
ungroup
data
df1 <- structure(list(State = c("Arizona", "Arizona", "Arizona", "Arizona",
"Arizona", "Arizona", "Arizona", "California", "California",
"California", "California", "California", "California"), City = c("Phoenix",
"Phoenix", "Phoenix", "Phoenix", "Scottsdale", "Scottsdale",
"Scottsdale", "San Francisco", "San Francisco", "San Francisco",
"Los Angeles", "Los Angeles", "Los Angeles"), Do_you_like_the_color_Blue = c("Yes",
"Yes", "Maybe", "No", "No", "Yes", "Maybe", "Yes", "No", "Maybe",
"Yes", "Yes", "No"), Yes = c(1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L,
0L, 0L, 1L, 1L, 0L), Maybe = c(0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 0L), No = c(0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -13L
))
CodePudding user response:
To create the columns, we can use the fastDummies
package, wich is usually much faster to write than recode or pivoting wider
library(fastDummies)
library(dplyr)
df %>% dummy_cols(`Do you like the color Blue?`)