I have data with several infrequent levels in each column. Here is the dput()
output of a small portion of data.
data <- structure(list(AppointmentMonth = structure(c(9L, 6L, 9L, 12L,
3L, 5L, 5L, 5L, 7L, 10L, 9L, 12L, 7L, 3L, 11L, 9L, 11L, 12L,
12L, 7L, 1L, 6L, 7L, 12L, 1L, 3L, 11L, 4L, 9L, 4L), levels = c("Jan",
"Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct",
"Nov", "Dec"), class = c("ordered", "factor")), DayofWeek = structure(c(2L,
3L, 2L, 5L, 3L, 6L, 3L, 3L, 3L, 5L, 4L, 2L, 2L, 4L, 2L, 5L, 3L,
2L, 4L, 3L, 4L, 6L, 6L, 5L, 2L, 2L, 3L, 2L, 3L, 5L), levels = c("Sun",
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat"), class = c("ordered",
"factor")), AppointmentHour = c(16L, 13L, 14L, 14L, 11L, 14L,
11L, 9L, 9L, 11L, 12L, 10L, 16L, 15L, 8L, 8L, 11L, 8L, 14L, 8L,
16L, 9L, 14L, 14L, 13L, 9L, 10L, 14L, 17L, 14L), EncounterType = structure(c(`Office Visit` = 1L,
`Office Visit` = 1L, `Procedure Visit` = 2L, `Office Visit` = 1L,
`Office Visit` = 1L, `Office Visit` = 1L, `Office Visit` = 1L,
`Office Visit` = 1L, `Office Visit` = 1L, Appointment = 3L, `Office Visit` = 1L,
`Office Visit` = 1L, `Office Visit` = 1L, `Office Visit` = 1L,
`Office Visit` = 1L, `Office Visit` = 1L, `Office Visit` = 1L,
`Procedure Visit` = 2L, `Office Visit` = 1L, `Office Visit` = 1L,
`Office Visit` = 1L, `Office Visit` = 1L, `Office Visit` = 1L,
`Procedure Visit` = 2L, `Office Visit` = 1L, `Office Visit` = 1L,
`Office Visit` = 1L, `Office Visit` = 1L, `Office Visit` = 1L,
`Office Visit` = 1L), levels = c("Office Visit", "Procedure Visit",
"Appointment", "Treatment", "Telemedicine"), class = "factor")), row.names = c(NA,
30L), class = "data.frame")
When you check the frequency table of each column you will realize some levels have 0, 1, or 2 occurrences. Here is an example.
table(data$AppointmentHour)
8 9 10 11 12 13 14 15 16 17
4 4 2 4 1 2 8 1 3 1
I would like to identify and remove the infrequent levels whose frequencies, let's say less than 3 (can be changed depending on the problem/data), in each column. I tried @akrun's answer in this question. Here is the code chunk:
library(data.table)
setDT(data)[data[, .I[.N >= 3], by = .(AppointmentMonth, DayofWeek, AppointmentHour, EncounterType)]$V1]
However, this code removes the infrequent levels based on the combination of columns, not based on each individual column.
CodePudding user response:
data |>
dplyr::add_count(
AppointmentHour
) |>
dplyr::filter(
n < 3
)
#> AppointmentMonth DayofWeek AppointmentHour EncounterType n
#> 1 Jun Tue 13 Office Visit 2
#> 2 Sep Wed 12 Office Visit 1
#> 3 Dec Mon 10 Office Visit 2
#> 4 Mar Wed 15 Office Visit 1
#> 5 Jan Mon 13 Office Visit 2
#> 6 Nov Tue 10 Office Visit 2
#> 7 Sep Tue 17 Office Visit 1
Created on 2022-11-08 with reprex v2.0.2
CodePudding user response:
library(dplyr)
library(rlang)
library(purrr)
col.tally <- function(dat, col) {
temp <- paste0(col, "_count")
dat %>%
group_by(!!col) %>%
mutate(!!temp := n()) %>%
ungroup()
}
map(names(data), ~col.tally(data, sym(.x))) %>%
reduce(left_join, by = names(data)) %>%
filter(across(contains("_count"), ~.x >= 3)) %>%
select(names(data))
#> # A tibble: 18 x 4
#> AppointmentMonth DayofWeek AppointmentHour EncounterType
#> <ord> <ord> <int> <fct>
#> 1 Sep Mon 16 Office Visit
#> 2 Sep Mon 14 Procedure Visit
#> 3 Dec Thu 14 Office Visit
#> 4 Mar Tue 11 Office Visit
#> 5 May Fri 14 Office Visit
#> 6 May Tue 11 Office Visit
#> 7 May Tue 9 Office Visit
#> 8 Jul Tue 9 Office Visit
#> 9 Jul Mon 16 Office Visit
#> 10 Nov Mon 8 Office Visit
#> 11 Sep Thu 8 Office Visit
#> 12 Nov Tue 11 Office Visit
#> 13 Dec Mon 8 Procedure Visit
#> 14 Dec Wed 14 Office Visit
#> 15 Jul Tue 8 Office Visit
#> 16 Jul Fri 14 Office Visit
#> 17 Dec Thu 14 Procedure Visit
#> 18 Mar Mon 9 Office Visit