Home > Software engineering >  Count Number of Rows in a Dataframe that Match Dynamic Conditions
Count Number of Rows in a Dataframe that Match Dynamic Conditions

Time:05-29

I have a dataframe like this:

ID <- c("AB1","AB1","CD2","AB3","KK4","AB3","AB3","AB1","AB1","CD2")
year <- c(2005,2008,2005,2010,2007,2009,2009,2007,2000,2010)
df <- data.frame(ID, year)
df
    ID year
1  AB1 2005
2  AB1 2008
3  CD2 2005
4  AB3 2010
5  KK4 2007
6  AB3 2009
7  AB3 2009
8  AB1 2007
9  AB1 2000
10 CD2 2010

I want to add a column xp with the number of rows that match ID and have a smaller value of year than the current row. I am looking for something like that:

df$xp <- nrow( ID == "ID in current row" & year < "year in current row" )

The result should be:

    ID year xp
1  AB1 2005  1
2  AB1 2008  3
3  CD2 2005  0
4  AB3 2010  2
5  KK4 2007  0
6  AB3 2009  0
7  AB3 2009  0
8  AB1 2007  2
9  AB1 2000  0
10 CD2 2010  1

CodePudding user response:

I'm sure there are more concise base R or data.table approaches, but here's an approach using dplyr and tidyr. This approach relies on a "non-equi join", which dplyr does not currently include (but which data.table and sqldf do), so instead I'm doing a cartesian join and then filtering, which will be less efficient for large data.

library(dplyr);library(tidyr)

left_join(                        # join...
  df,                             # each row of df...
  df %>%                          # with each matching row of a table where...
    left_join(df, by = "ID") %>%  # each row of df is joined to all the rows with same ID
    filter(year.y < year.x) %>%   # and we only keep preceding years
    count(ID, year = year.x),     # and we count how many there are per ID
  by = c("ID", "year")) %>%
  replace_na(list(n=0))           # and we replace the NA's with zeroes

CodePudding user response:

Here is an approach using dplyr and purrr:

library(dplyr)
library(purrr)

df %>%
    group_by(ID) %>%
    mutate(xp = map_int(year, function(x) sum(cur_data()$year < x)))

purrr::map_int runs the anonymous function for all elements of the year column. dplyr::cur_data() returns the data of the current group as a data frame.

CodePudding user response:

Here is a data.table solution:

library(data.table)
setDT(df)
df[, xp:=sapply(1:.N, \(x) sum(year < year[x])), by=ID][]
#>      ID year xp
#>  1: AB1 2005  1
#>  2: AB1 2008  3
#>  3: CD2 2005  0
#>  4: AB3 2010  2
#>  5: KK4 2007  0
#>  6: AB3 2009  0
#>  7: AB3 2009  0
#>  8: AB1 2007  2
#>  9: AB1 2000  0
#> 10: CD2 2010  1
  • Related