Deleting duplicates in data based on the conditions of a date column and a different column in r-CodePudding

In my data, I have 10 unique sample dates for each of the 15 animals we measured clinical signs for. For each date, two people took clinical signs (temperature, swelling, etc) for each animal. Taking all other parts of the data into account, each animal has four rows with the same sample date. For two of the rows, there is one initial and for the other two rows there is either a different initial OR an NA (for when that sampler was absent that day). My goal in my data is to delete the rows which, for the same date, have one set of initials in 2 of the 4 rows (per unique animal), but an NA in the other 2 rows with that same date (for that same animal).

Clarification: There are other NAs in the Initials column that I would like to leave in. For example, for animal 6, I would like to leave in all of the NAs. But for the other animals with 4 rows that have two of those rows populated with initials and the other two rows have NAs, I would like to delete the NA rows. Thank you!

Here is some sample code:

Data <- data.frame(matrix(ncol = 3, nrow = 24))
colnames(Data) <- c('AnimalID', 'DateSampled', 'Initials')

Data$AnimalID <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6)

Data$DateSampled <- as.Date(c("2021-10-13", "2021-10-13", "2021-10-13", "2021-10-13", "2021-10-27", "2021-10-27", "2021-10-27", "2021-10-27", "2021-11-10", "2021-11-10", "2021-11-10", "2021-11-10", "2021-11-24", "2021-11-24", "2021-11-24", "2021-11-24", "2021-12-01", "2021-12-01", "2021-12-01", "2021-12-01", "2021-12-05", "2021-12-05", "2021-12-05", "2021-12-05"))

Data$Initials <- c("AB", "AB", NA, NA, "AB", "AB", "CD", "CD", "AB", "AB", NA, NA, "AB", "AB", "CD", "CD", "AB", "AB", NA, NA, NA, NA, NA, NA)

Desired Output:

AnimalID  |  DateSampled  | Initials
1         |  "2021-10-13" |  AB
1         |  "2021-10-13" |  AB
2         |  "2021-10-27" |  AB
2         |  "2021-10-27" |  AB
2         |  "2021-10-27" |  CD
2         |  "2021-10-27" |  CD
3         |  "2021-11-10" |  AB
3         |  "2021-11-10" |  AB
4         |  "2021-11-24" |  AB
4         |  "2021-11-24" |  AB
4         |  "2021-11-24" |  CD
4         |  "2021-11-24" |  CD
5         |  "2021-12-01" |  AB
5         |  "2021-12-01" |  AB
6         |  "2021-12-05" |  NA
6         |  "2021-12-05" |  NA
6         |  "2021-12-05" |  NA
6         |  "2021-12-05" |  NA

Whether its for loops or a conditional vector, if there is an "AB" (or any other set of initials) AND an "NA" for the same animal id and sample date, I would love to delete the rows that have NA in them. Thank you for your help!

CodePudding user response：

Here is a way to achieve this using dplyr. filter(!is.na(Initials)) wil remove all the rows with NA. distinct() will get rid of the duplicated rows:

library(dplyr)

Data %>% 
  filter(!is.na(Initials)) %>% 
  distinct()

  EweID DateSampled Initials
1     1  2021-10-13       AB
2     2  2021-10-27       AB
3     2  2021-10-27       CD
4     3  2021-11-10       AB
5     4  2021-11-24       AB
6     4  2021-11-24       CD
7     5  2021-12-01       AB

Update

Thanks for clarifying your output, here is a way to achieve. First is to create a intermediate data frame for each animal and counting the number of NA per group:

Number_of_NA = Data %>%
  group_by(AnimalID)%>%
  summarise(n = sum(is.na(Initials)))

> Number_of_NA
# A tibble: 7 x 2
  AnimalID     n
     <dbl> <int>
1        1     2
2        2     0
3        3     2
4        4     0
5        5     2
6        6     4
7        7     4

If I understand correctly the group you want to keep with NA will always have 4 values with NA. You can use this to filter all the NA in the data frame as previously and then join the group with only the 4 NAs:

Data %>% filter(!is.na(Initials)) %>%
  full_join(filter(Data, AnimalID %in% Number_of_NA$AnimalID[Number_of_NA$n == 4]))

   AnimalID DateSampled Initials
1         1  2021-10-13       AB
2         1  2021-10-13       AB
3         2  2021-10-27       AB
4         2  2021-10-27       AB
5         2  2021-10-27       CD
6         2  2021-10-27       CD
7         3  2021-11-10       AB
8         3  2021-11-10       AB
9         4  2021-11-24       AB
10        4  2021-11-24       AB
11        4  2021-11-24       CD
12        4  2021-11-24       CD
13        5  2021-12-01       AB
14        5  2021-12-01       AB
15        6  2021-12-05     <NA>
16        6  2021-12-05     <NA>
17        6  2021-12-05     <NA>
18        6  2021-12-05     <NA>
19        7  2021-12-15     <NA>
20        7  2021-12-15     <NA>
21        7  2021-12-15     <NA>
22        7  2021-12-15     <NA>

Data

Data = structure(list(AnimalID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 
3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7), DateSampled = structure(c(18913, 
18913, 18913, 18913, 18927, 18927, 18927, 18927, 18941, 18941, 
18941, 18941, 18955, 18955, 18955, 18955, 18962, 18962, 18962, 
18962, 18966, 18966, 18966, 18966, 18976, 18976, 18976, 18976
), class = "Date"), Initials = c("AB", "AB", NA, NA, "AB", "AB", 
"CD", "CD", "AB", "AB", NA, NA, "AB", "AB", "CD", "CD", "AB", 
"AB", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 
-28L), class = "data.frame")

CodePudding user response：

I would be better if you could provide an expected output.

The logic of filtering is a bit hard to understand.

From what I can gather, if you just want to remove all NA data in the Initials column and remove duplicated rows

Data <- Data[!is.na(Data$Initials),]
Data <- Data[!duplicated(Data),]

This is what I guess that you want to achieve using tidyverse:

distinct will output only distinct row from the data
filter will remove any row with NA in the Initials field

library(tidyverse)
Data %>%
  distinct() %>%
  filter(!is.na(Initials))
# EweID DateSampled Initials
# 1     1  2021-10-13       AB
# 2     2  2021-10-27       AB
# 3     2  2021-10-27       CD
# 4     3  2021-11-10       AB
# 5     4  2021-11-24       AB
# 6     4  2021-11-24       CD
# 7     5  2021-12-01       AB

If you want to still include the NA row where the EweID do not have any other Initials beside NA. Just add another step to find the EweID-DateSampled that only have NA in their Initials column

Data %>% distinct() %>%
  group_by(EweID, DateSampled) %>%
  summarise("var"=paste(Initials, collapse='-'))
# EweID DateSampled   var
# 1     1  2021-10-13 AB-NA
# 2     2  2021-10-27 AB-CD
# 3     3  2021-11-10 AB-NA
# 4     4  2021-11-24 AB-CD
# 5     5  2021-12-01 AB-NA
# 6     6  2021-12-02    NA

filter the NA row and rbind it to the above output

Data %>% distinct() %>%
  group_by(EweID, DateSampled) %>%
  summarise("var"=paste(Initials, collapse='-')) %>%
  filter(var=="NA")
# EweID DateSampled var
# 1     6  2021-12-02  NA