Home > Blockchain >  Create a presence-absence matrix with presence on specific dates
Create a presence-absence matrix with presence on specific dates

Time:02-16

I want to create a presence/absence matrix that has the date on the y-axes and individuals on the x-axes. When an individual was present on the particular date, the output should be 1, while if it was absent the output should be 0. I have a dataframe with the names of individuals and the dates that they are present in a group:

ID Start End
Afr 2015-06-29 2016-02-16
Ahe 2016-12-18 2018-02-24
Art 2015-07-01 2020-04-30
...

In total I have over a thousand individuals and their dates.

I also have a list/dataframe that contains all the dates from 2015-01-01 to 2021-31-12.

My output data needs to look like this:

Date Afr Ahe Art ...
2015-07-01 1 0 0 ...
2015-07-02 1 0 1 ...
2015-07-03 1 0 1 ...
...

Where the output is 1 when an individual was present in the group at that time and a 0 when it was not. I feel like there should be an easy solution for creating this but so far I have not managed. One of the problems I am encountering is that the list of dates is for example longer than the dataframe with the individuals, making a dcast function for example impossible.

Any help or suggestions would be greatly appreciated! Please also let me know if I should provide more code/background.

Thank you very much!

CodePudding user response:

We can try the code below

library(data.table)

setDT(df1)
setDT(df2)

na.omit(
  dcast(
    df1[df2, .(Date, ID), on = .(Start < Date, End > Date)][df1, on = .(ID)],
    Date ~ ID,
    fun.aggregate = length
  )
)

which gives

         Date Afr Ahe Art
1: 2015-07-01   1   0   0
2: 2015-07-02   1   0   1
3: 2015-07-03   1   0   1

Data

> dput(df1)
structure(list(ID = c("Afr", "Ahe", "Art"), Start = structure(c(16615, 
17153, 16617), class = "Date"), End = structure(c(16847, 17586,
18382), class = "Date")), class = "data.frame", row.names = c(NA,
-3L))

> dput(df2)
structure(list(Date = structure(c(16617, 16618, 16619), class = "Date")), class = "data.frame", row.names = c(NA,
-3L))
  • Related