Creating counts for continuous variable to be grouped by another variable-CodePudding

I have a dataset of 65000 rows. Here is a glimpse of the same for explanation purposes.

state <- c("TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "WI", "WI", "WI", "WI", "WI")

centrac <- c("TX01", "TX02", "TX03", "TX04", "TX05", "TX06", "TX07", "TX08", "CA01", "CA02", "CA03", "CA04", "CA05", "CA06", "CA07", "WI01", "WI02", "WI03", "WI04", "WI05")

pov <- c(48.00, 35.00, 7.04, 18.24, 28.45, 43.76, 3.91, 68.47, 5.29, 11.82, 27.89, 30.13, 17.63, 7.71, 3.45, 56.38, 2.17, 21.09, 23.56, 45.67)

df <- data.frame(state, centrac, pov)
print(df)

   state    centrac   pov
1     TX    TX01      48.00
2     TX    TX02      35.00
3     TX    TX03       7.04
4     TX    TX04      18.24
5     TX    TX05      28.45
6     TX    TX06      43.76
7     TX    TX07       3.91
8     TX    TX08      68.47
9     CA    CA01       5.29
10    CA    CA02      11.82
11    CA    CA03      27.89
12    CA    CA04      30.13
13    CA    CA05      17.63
14    CA    CA06       7.71
15    CA    CA07       3.45
16    WI    WI01      56.38
17    WI    WI02       2.17
18    WI    WI03      21.09
19    WI    WI04      23.56
20    WI    WI05      45.67

First, I need to find out the number of centrac by counts for pov in 5 ranges:

Less than 10 (Very Low Poverty)

10-20 (Low Poverty)

20-30 (Medium Poverty)

30-40 (High Poverty)

More than 40 (Very High Poverty)

And then group these counts by state, so that I get the following table:

Number of Census Tracts by Poverty

State     Very Low Poverty    Low Poverty   Medium Poverty   High Poverty   Very High Poverty
            (Less than 10)       (10-20)        (20-30)         (30-40)       (More than 40)
TX                2                 1             1                 1               3
CA                3                 2             1                 1               0
WI                1                 0             2                 0               2

I am able to use group_by to get the data arranged by state, however I am stuck at the first part.

Much appreciate any help!

CodePudding user response：

You can use cut() to create bins from a continuous variable.

df[["pov_level"]] <- cut(
  pov, 
  breaks = c(-Inf, 10, 20, 30, 40, Inf),
  labels = c(
    "Less than 10 (Very Low Poverty)",
    "10-20 (Low Poverty)",
    "20-30 (Medium Poverty)",
    "30-40 (High Poverty)",
    "More than 40 (Very High Poverty)"
  )
)

table(df$state, df$pov_level)
#    Less than 10 (Very Low Poverty) 10-19 (Low Poverty) 20-29 (Medium Poverty)
# CA                               3                   2                      1
# TX                               2                   1                      1
# WI                               1                   0                      2
# 
#    30-39 (High Poverty) 40 or more (Very High Poverty)
# CA                    1                              0
# TX                    1                              3
# WI                    0                              2

CodePudding user response：

Still by cut(), another option to take count(aggregate()):

state <- c("TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "WI", "WI", "WI", "WI", "WI")

centrac <- c("TX01", "TX02", "TX03", "TX04", "TX05", "TX06", "TX07", "TX08", "CA01", "CA02", "CA03", "CA04", "CA05", "CA06", "CA07", "WI01", "WI02", "WI03", "WI04", "WI05")

pov <- c(48.00, 35.00, 7.04, 18.24, 28.45, 43.76, 3.91, 68.47, 5.29, 11.82, 27.89, 30.13, 17.63, 7.71, 3.45, 56.38, 2.17, 21.09, 23.56, 45.67)

df <- data.frame(state, centrac, pov)

cut_breaks = breaks = c(0,10,20,30,40,Inf)

cut_labels = c("Very Low Poverty","Low Poverty","Medium Poverty","High Poverty","Very High Poverty")

df$pov_cut = cut(df$pov,breaks = cut_breaks,labels = cut_labels, include.lowest = TRUE, right = FALSE)

df_povCutCount = aggregate(x = list(pov_cut_count = df$pov),
                           by = list(pov_cut = df$pov_cut, state = df$state),
                           FUN = "length")

head(df_povCutCount)
    
#               pov_cut state pov_cut_count
#    1 Very Low Poverty    CA             3
#    2      Low Poverty    CA             2
#    3   Medium Poverty    CA             1
#    4     High Poverty    CA             1
#    5 Very Low Poverty    TX             2
#    6      Low Poverty    TX             1