Home > database >  Counting number of rows with values greater or equal to each value in a sequence
Counting number of rows with values greater or equal to each value in a sequence

Time:10-19

I'm trying to count the number of rows containing a value greater or equal to a number in a defined sequence and grouped on a second variable. For example the number of rows with values greater or equal to 300, 400, 500 grouped by Company A, Company B and Company C found in another column. In Excel I would just use the COUNTIFS function but I'd like not having to go via Excel for this task.

Example dataset and desired result:

library(tidyverse)

df <- tibble::tribble(
        ~Company, ~Sales,
             "B",   902L,
             "B",   575L,
             "C",   194L,
             "C",   215L,
             "A",   515L,
             "B",   728L,
             "A",   910L,
             "C",   889L,
             "A",   854L,
             "B",   230L,
             "C",   188L,
             "C",   442L,
             "A",   174L,
             "A",   723L,
             "B",   904L,
             "A",   761L,
             "B",   987L,
             "B",   521L,
             "B",   694L,
             "B",   530L,
             "C",   165L,
             "A",   507L,
             "B",   316L,
             "A",   452L,
             "A",   342L,
             "B",   413L,
             "B",   121L,
             "A",   650L,
             "B",   801L,
             "C",   100L
        )

result <- tibble::tribble(
             ~Company, ~Greater.or.equal.to, ~Count,
                  "A",                 300L,     9L,
                  "A",                 400L,     8L,
                  "A",                 500L,     7L,
                  "A",                 600L,     5L,
                  "A",                 700L,     4L,
                  "A",                 800L,     2L,
                  "A",                 900L,     1L,
                  "A",                1000L,     0L,
                  "B",                 300L,    11L,
                  "B",                 400L,    10L,
                  "B",                 500L,     9L,
                  "B",                 600L,     6L,
                  "B",                 700L,     5L,
                  "B",                 800L,     4L,
                  "B",                 900L,     3L,
                  "B",                1000L,     0L,
                  "C",                 300L,     2L,
                  "C",                 400L,     2L,
                  "C",                 500L,     1L,
                  "C",                 600L,     1L,
                  "C",                 700L,     1L,
                  "C",                 800L,     1L,
                  "C",                 900L,     0L,
                  "C",                1000L,     0L
             )

I know how to find individual rows using either base R or dplyr (much more familiar with the Tidyverse) but haven't been able to figure out a way to check for a sequence of values. I've tried creating a for loop in the hope of getting the right answer but is clearly doing something wrong.

# These two versions work but is inefficient for when we have a long sequence of variables to check against
length(which(df$Company == "A" & df$Sales >= 300))

df %>% 
  group_by(Company) %>% 
  summarise(count = sum(Sales >= 300))

# Attempt at a loop
# Sequence of values to loop over. The number sequence can change as the column we're checking
# against are changing
sequence <- seq(300, 1000, 100)
companies <- c("A", "B", "C")

counting <- function(data, col1, col2, range1, range2){
  for (i in range1){
    for (j in range2){
      length(which(data$col1 == i & data$col2 >= j))
    }
  }
}

counting(df, Company, Sales, companies, sequence)

Any advice is greatly appreciated!

CodePudding user response:

We may loop over the seq from 300 to 1000 by 100, filter the data after grouping by 'Company' using the looped value, create a summarised column with number of rows (n()), bind the list elements and use complete to fill the missing combinations with '0' for 'Count' column

library(dplyr)
library(purrr)
library(tidyr)
out <- map(seq(300, 1000, by = 100), ~ 
      df %>%
        group_by(Company) %>% 
        filter(Sales >= .x) %>% 
       summarise(Greater.or.equal.to = .x, Count = n())) %>%
     bind_rows %>%
     complete(Company, Greater.or.equal.to = seq(300, 1000,
               by = 100), fill = list(Count = 0))

-output

out
# A tibble: 24 × 3
   Company Greater.or.equal.to Count
   <chr>                 <dbl> <dbl>
 1 A                       300     9
 2 A                       400     8
 3 A                       500     7
 4 A                       600     5
 5 A                       700     4
 6 A                       800     2
 7 A                       900     1
 8 A                      1000     0
 9 B                       300    11
10 B                       400    10
# … with 14 more rows
> all.equal(out, result)
[1] TRUE
  • Related