Home > Blockchain >  Finding which IDs do not have a full year worth of data
Finding which IDs do not have a full year worth of data

Time:05-26

I have a very large dataset (20 years of telemetry data) and I am trying to understand if I am missing any days worth of data.

Using the example dataframe below:

dates <- rep(seq(as.Date("2003/01/01"), as.Date("2006/12/31"), "days"), each=4)
id <- rep(1:4, times=length(unique(dates)))
df <- data.frame(dates=dates, id=id)

This dataframe should be complete in that it has all the dates present, so if I remove some:

df <- df[-4, ]
df <- df[-6, ]

Now I want to figure out what dates I am missing and what animal ID these dates belong to. I have been using up to this point the following code:

FullSeqID <- seq.Date(from=min(df$dates), to=max(df$dates), by=1)
missingID <- FullSeqID[!FullSeqID %in% df$dates]

This will tell me what days I am missing data on, however I am trying to figure out what the ID is for the animal that is missing those days (and if it is just one that I can drop or multiple animals in which I will populate information for).

NOTE because in my own dataset I have 600 animals over 20 years I have done this running the following loop, as when using just the code above it returns that I am not missing any days as there is at least one animal with a location per day out of my dataset:


y <- split(my_data, my_data$Animal_ID)

for (i in 1:length(y)) {
  AnimalIDloop <- y[[i]]
  myDataDateID <- AnimalIDloop[, c("Date")]
  FullSeqID <- seq.Date(from=min(myDataDateID), to=max(myDataDateID), by=1)
  missingID <- FullSeqID[!FullSeqID %in% myDataDateID]
}
missingID

This returns a number of dates that I am missing (specifically in one year) but again I am unsure if this is for one animal or multiple animals missing data in that year. I am sure there is a simple solution I could potentially even add to my loop, and I am just unable to think of it.

CodePudding user response:

Here are some dplyr approaches to this, which I think is a great tool for this sort of exploratory data analysis and cleaning:

library(dplyr)

How many rows are present for each id? We can see 3 & 4 have fewer:

df %>%
  count(id)

#  id    n
#1  1 1461
#2  2 1461
#3  3 1460
#4  4 1460

In case there might be duplicate rows of the same animal on the same day, we could add a de-dupe step to see how many days are present for each id:

df %>%
  distinct(dates, id) %>%  # dedupe multiple rows for one date & one animal 
  count(id)

#  id    n
#1  1 1461
#2  2 1461
#3  3 1460
#4  4 1460

We might also wonder which dates have fewer than max # of animals?

df %>%
  count(dates) %>%
  filter(n != max(n)) 

#       dates n
#1 2003-01-01 3
#2 2003-01-02 3

It sounds like you are most interested in getting a list of the missing animal-dates. There's probably a more elegant approach, but one way could be to make up some placeholder data (ie the "presence" column here), use complete from tidyr to add every possible date in the data to every possible id, and then to just keep the rows that weren't there originally, ie which don't have that placeholder. One shortfall of this approach is that it won't detect a date that is missing for all animals. The longer version below that would address that issue.

(If you only want to look at the range of data from each animal's first data to their last data, you could add a group_by(id) %>% before the complete(dates = seq.Date... step so that the dates are only filled in for that animal's range of dates.)

df %>%
  mutate(presence = "present") %>%
  complete(dates, id) %>%
  #complete(dates = seq.Date(min(dates), max(dates), by = "day"), id) %>%
  filter(is.na(presence))

## A tibble: 2 × 3
#  dates         id presence
#  <date>     <int> <chr>   
#1 2003-01-01     4 NA      
#2 2003-01-02     3 NA 

CodePudding user response:

Since you already split your dataset, you may simply run the code you were using in a function in lapply.

y <- split(my_data, my_data$Animal_ID)
lapply(y, function(x) FullSeqID[!FullSeqID %in% x$dates])
# $`1`
# Date of length 0
# 
# $`2`
# Date of length 0
# 
# $`3`
# [1] "2003-01-02"
# 
# $`4`
# [1] "2003-01-01"

Or use it in by which combines split and by in a sense.

by(my_data, my_data$Animal_ID, function(x) FullSeqID[!FullSeqID %in% x$dates])
# my_data$Animal_ID: 1
# Date of length 0
# ------------------------------------------------------------------------------------------ 
# my_data$Animal_ID: 2
# Date of length 0
# ------------------------------------------------------------------------------------------ 
# my_data$Animal_ID: 3
# [1] "2003-01-02"
# ------------------------------------------------------------------------------------------ 
# my_data$Animal_ID: 4
# [1] "2003-01-01" 
  •  Tags:  
  • r
  • Related