Home > Software engineering >  Remove infrequent rows for the count of values in EACH column
Remove infrequent rows for the count of values in EACH column

Time:11-09

I have data with several infrequent levels in each column. Here is the dput() output of a small portion of data.

data <- structure(list(AppointmentMonth = structure(c(9L, 6L, 9L, 12L, 
3L, 5L, 5L, 5L, 7L, 10L, 9L, 12L, 7L, 3L, 11L, 9L, 11L, 12L, 
12L, 7L, 1L, 6L, 7L, 12L, 1L, 3L, 11L, 4L, 9L, 4L), levels = c("Jan", 
"Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", 
"Nov", "Dec"), class = c("ordered", "factor")), DayofWeek = structure(c(2L, 
3L, 2L, 5L, 3L, 6L, 3L, 3L, 3L, 5L, 4L, 2L, 2L, 4L, 2L, 5L, 3L, 
2L, 4L, 3L, 4L, 6L, 6L, 5L, 2L, 2L, 3L, 2L, 3L, 5L), levels = c("Sun", 
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat"), class = c("ordered", 
"factor")), AppointmentHour = c(16L, 13L, 14L, 14L, 11L, 14L, 
11L, 9L, 9L, 11L, 12L, 10L, 16L, 15L, 8L, 8L, 11L, 8L, 14L, 8L, 
16L, 9L, 14L, 14L, 13L, 9L, 10L, 14L, 17L, 14L), EncounterType = structure(c(`Office Visit` = 1L, 
`Office Visit` = 1L, `Procedure Visit` = 2L, `Office Visit` = 1L, 
`Office Visit` = 1L, `Office Visit` = 1L, `Office Visit` = 1L, 
`Office Visit` = 1L, `Office Visit` = 1L, Appointment = 3L, `Office Visit` = 1L, 
`Office Visit` = 1L, `Office Visit` = 1L, `Office Visit` = 1L, 
`Office Visit` = 1L, `Office Visit` = 1L, `Office Visit` = 1L, 
`Procedure Visit` = 2L, `Office Visit` = 1L, `Office Visit` = 1L, 
`Office Visit` = 1L, `Office Visit` = 1L, `Office Visit` = 1L, 
`Procedure Visit` = 2L, `Office Visit` = 1L, `Office Visit` = 1L, 
`Office Visit` = 1L, `Office Visit` = 1L, `Office Visit` = 1L, 
`Office Visit` = 1L), levels = c("Office Visit", "Procedure Visit", 
"Appointment", "Treatment", "Telemedicine"), class = "factor")), row.names = c(NA, 
30L), class = "data.frame")

When you check the frequency table of each column you will realize some levels have 0, 1, or 2 occurrences. Here is an example.

table(data$AppointmentHour)
8  9 10 11 12 13 14 15 16 17 
4  4  2  4  1  2  8  1  3  1 

I would like to identify and remove the infrequent levels whose frequencies, let's say less than 3 (can be changed depending on the problem/data), in each column. I tried @akrun's answer in this question. Here is the code chunk:

library(data.table)
setDT(data)[data[, .I[.N >= 3], by = .(AppointmentMonth, DayofWeek, AppointmentHour, EncounterType)]$V1]

However, this code removes the infrequent levels based on the combination of columns, not based on each individual column.

CodePudding user response:

data |>
  dplyr::add_count(
    AppointmentHour
  ) |>
  dplyr::filter(
    n < 3
  )
#>   AppointmentMonth DayofWeek AppointmentHour EncounterType n
#> 1              Jun       Tue              13  Office Visit 2
#> 2              Sep       Wed              12  Office Visit 1
#> 3              Dec       Mon              10  Office Visit 2
#> 4              Mar       Wed              15  Office Visit 1
#> 5              Jan       Mon              13  Office Visit 2
#> 6              Nov       Tue              10  Office Visit 2
#> 7              Sep       Tue              17  Office Visit 1

Created on 2022-11-08 with reprex v2.0.2

CodePudding user response:

library(dplyr)
library(rlang)
library(purrr)

col.tally <- function(dat, col) {
  temp <- paste0(col, "_count")
  dat %>% 
    group_by(!!col) %>% 
    mutate(!!temp := n()) %>% 
    ungroup()
}


map(names(data), ~col.tally(data, sym(.x))) %>% 
  reduce(left_join, by = names(data)) %>% 
  filter(across(contains("_count"), ~.x >= 3)) %>% 
  select(names(data))
#> # A tibble: 18 x 4
#>    AppointmentMonth DayofWeek AppointmentHour EncounterType  
#>    <ord>            <ord>               <int> <fct>          
#>  1 Sep              Mon                    16 Office Visit   
#>  2 Sep              Mon                    14 Procedure Visit
#>  3 Dec              Thu                    14 Office Visit   
#>  4 Mar              Tue                    11 Office Visit   
#>  5 May              Fri                    14 Office Visit   
#>  6 May              Tue                    11 Office Visit   
#>  7 May              Tue                     9 Office Visit   
#>  8 Jul              Tue                     9 Office Visit   
#>  9 Jul              Mon                    16 Office Visit   
#> 10 Nov              Mon                     8 Office Visit   
#> 11 Sep              Thu                     8 Office Visit   
#> 12 Nov              Tue                    11 Office Visit   
#> 13 Dec              Mon                     8 Procedure Visit
#> 14 Dec              Wed                    14 Office Visit   
#> 15 Jul              Tue                     8 Office Visit   
#> 16 Jul              Fri                    14 Office Visit   
#> 17 Dec              Thu                    14 Procedure Visit
#> 18 Mar              Mon                     9 Office Visit
  • Related