I have a dataset of 65000 rows. Here is a glimpse of the same for explanation purposes.
state <- c("TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "WI", "WI", "WI", "WI", "WI")
centrac <- c("TX01", "TX02", "TX03", "TX04", "TX05", "TX06", "TX07", "TX08", "CA01", "CA02", "CA03", "CA04", "CA05", "CA06", "CA07", "WI01", "WI02", "WI03", "WI04", "WI05")
pov <- c(48.00, 35.00, 7.04, 18.24, 28.45, 43.76, 3.91, 68.47, 5.29, 11.82, 27.89, 30.13, 17.63, 7.71, 3.45, 56.38, 2.17, 21.09, 23.56, 45.67)
df <- data.frame(state, centrac, pov)
print(df)
state centrac pov
1 TX TX01 48.00
2 TX TX02 35.00
3 TX TX03 7.04
4 TX TX04 18.24
5 TX TX05 28.45
6 TX TX06 43.76
7 TX TX07 3.91
8 TX TX08 68.47
9 CA CA01 5.29
10 CA CA02 11.82
11 CA CA03 27.89
12 CA CA04 30.13
13 CA CA05 17.63
14 CA CA06 7.71
15 CA CA07 3.45
16 WI WI01 56.38
17 WI WI02 2.17
18 WI WI03 21.09
19 WI WI04 23.56
20 WI WI05 45.67
First, I need to find out the number of centrac
by counts for pov
in 5 ranges:
Less than 10 (Very Low Poverty)
10-20 (Low Poverty)
20-30 (Medium Poverty)
30-40 (High Poverty)
More than 40 (Very High Poverty)
And then group these counts by state
, so that I get the following table:
Number of Census Tracts by Poverty
State Very Low Poverty Low Poverty Medium Poverty High Poverty Very High Poverty
(Less than 10) (10-20) (20-30) (30-40) (More than 40)
TX 2 1 1 1 3
CA 3 2 1 1 0
WI 1 0 2 0 2
I am able to use group_by
to get the data arranged by state
, however I am stuck at the first part.
Much appreciate any help!
CodePudding user response:
You can use cut()
to create bins from a continuous variable.
df[["pov_level"]] <- cut(
pov,
breaks = c(-Inf, 10, 20, 30, 40, Inf),
labels = c(
"Less than 10 (Very Low Poverty)",
"10-20 (Low Poverty)",
"20-30 (Medium Poverty)",
"30-40 (High Poverty)",
"More than 40 (Very High Poverty)"
)
)
table(df$state, df$pov_level)
# Less than 10 (Very Low Poverty) 10-19 (Low Poverty) 20-29 (Medium Poverty)
# CA 3 2 1
# TX 2 1 1
# WI 1 0 2
#
# 30-39 (High Poverty) 40 or more (Very High Poverty)
# CA 1 0
# TX 1 3
# WI 0 2
CodePudding user response:
Still by cut(), another option to take count(aggregate()):
state <- c("TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "WI", "WI", "WI", "WI", "WI")
centrac <- c("TX01", "TX02", "TX03", "TX04", "TX05", "TX06", "TX07", "TX08", "CA01", "CA02", "CA03", "CA04", "CA05", "CA06", "CA07", "WI01", "WI02", "WI03", "WI04", "WI05")
pov <- c(48.00, 35.00, 7.04, 18.24, 28.45, 43.76, 3.91, 68.47, 5.29, 11.82, 27.89, 30.13, 17.63, 7.71, 3.45, 56.38, 2.17, 21.09, 23.56, 45.67)
df <- data.frame(state, centrac, pov)
cut_breaks = breaks = c(0,10,20,30,40,Inf)
cut_labels = c("Very Low Poverty","Low Poverty","Medium Poverty","High Poverty","Very High Poverty")
df$pov_cut = cut(df$pov,breaks = cut_breaks,labels = cut_labels, include.lowest = TRUE, right = FALSE)
df_povCutCount = aggregate(x = list(pov_cut_count = df$pov),
by = list(pov_cut = df$pov_cut, state = df$state),
FUN = "length")
head(df_povCutCount)
# pov_cut state pov_cut_count
# 1 Very Low Poverty CA 3
# 2 Low Poverty CA 2
# 3 Medium Poverty CA 1
# 4 High Poverty CA 1
# 5 Very Low Poverty TX 2
# 6 Low Poverty TX 1