Home > Software engineering >  How to group data using two columns in r
How to group data using two columns in r

Time:09-27

I have sample data

s_name <- c("AL", "AL", "CA", "CA", "WI", "WI", "NJ", "NJ", "UT", "UT")
n_unit <- c(40, 30, 150, 110, 45, 80, 70, 40, 50, 90)
li_unit <- c(30, 30, 70, 40, 15, 80, 50, 40, 45, 45)
pv_lvl <- c("High", "Very High", "Medium", "Low", "Very Low", "Medium", "Very High", "low", "Very Low", "High")

mydata <- as.data.frame(cbind(s_name, n_unit, li_unit, pv_lvl))
mydata$n_unit <- as.numeric(mydata$n_unit)
mydata$li_unit <- as.numeric(mydata$li_unit)

mydata$per_li = mydata$li_unit/mydata$n_unit*100

print(mydata)

What I am trying to generate is a table that shows the percent of li_unit in each type of pv_lvl grouped by s_name.
Something like this:

s_name Very Low    Low    Medium   High    Very High
AL     0.00        0.00   0.00     75.00   100.00 
CA     0.00        36.36  46.67    0.00    0.00
WI     33.33       0.00   100.00   0.00    0.00 
NJ     100.00      0.00   0.00     0.00    71.43
UT     90.00       0.00   0.00     50.00   0.00

I tried the group_by function but did not get the desired result.
Thanks for your time and help!

CodePudding user response:

Using reshape2

library(reshape2)
dcast(mydata,s_name~pv_lvl,value.var="per_li",mean)

  s_name High low      Low    Medium Very High Very Low
1     AL   75 NaN      NaN       NaN 100.00000      NaN
2     CA  NaN NaN 36.36364  46.66667       NaN      NaN
3     NJ  NaN 100      NaN       NaN  71.42857      NaN
4     UT   50 NaN      NaN       NaN       NaN 90.00000
5     WI  NaN NaN      NaN 100.00000       NaN 33.33333

Note: you should fix some pv_lvl values (lower/uppercase), also you can factor it to have it in a specific order.

CodePudding user response:

Here's the standard pivot_wider way:

library(tidyr)
mydata$pv_lvl <- factor(mydata$pv_lvl, levels = c("Very Low", "Low", "Medium", "High", "Very High"))

pivot_wider(mydata, s_name, 
            names_from = pv_lvl, 
            values_from = per_li, 
            values_fill = 0,
            names_sort = TRUE)

output

# A tibble: 5 × 6
  s_name `Very Low`   Low Medium  High `Very High`
  <chr>       <dbl> <dbl>  <dbl> <dbl>       <dbl>
1 AL            0     0      0      75       100  
2 CA            0    36.4   46.7     0         0  
3 WI           33.3   0    100       0         0  
4 NJ            0   100      0       0        71.4
5 UT           90     0      0      50         0  

Note that you don't have to have to use cbind to create the dataframe, mydata <- data.frame(s_name, n_unit, li_unit, pv_lvl) is enough. Also, I created a factor out of pv_lvl so that you can sort them as desired using names_sort in pivot_wider.

  •  Tags:  
  • r
  • Related