How to check for duplicates for multiple time periods and different locations-CodePudding

I need some help creating a loop function in my dataset. I have data from measured substances A, B, C, D ... in different location ID 1,2,3,4... during different time periods.

the data looks like this:

Location_ID	Substance	date
1	A	16.02.2021
2	A	18.02.2021
1	A	17.02.2021
2	B	18.02.2021
1	B	19.02.2021
2	A	18.02.2021
1	C	17.02.2021
2	C	18.02.2021

The goal is to check for each date and for each ID if there is the same substance measured more than once. As you can see we have two rows of Substance A on the 18.02.2021 in location 2. If the loop doesn`t find a duplicate i want something like: print(No duplicate found) else print(("duplicate found"), and print list of each row containing a duplicate)

I`m new to programming so I would appreciate also an explanation to the Code, you hopefully can solve :-)

Thank you very much!!!

CodePudding user response：

Let's produce a dummy data first:

data<-data.frame(Location_ID=c(1,2,1,2,1,2,1,2),Substance=c("A","A","A","B","B","A","C","C"),date=c("16.02.2021","18.02.2021","17.02.2021","18.02.2021","19.02.2021","18.02.2021","17.02.2021","18.02.2021"))

The main strategy is to compare each row with others to find which rows are exactly same. For eg,

data[2,]==data[6,]
Location_ID Substance date
TRUE      TRUE TRUE

Basically, if it has 3 TRUE, then its a duplicate. If we can find out how many TRUE are there for each comparison, we will know which rows are duplicates

length(grep("TRUE",data[2,]==data[6,]))
[1] 3

Also, we have to ensure that each row doesn't get compared to itself. The required scripts are given along with their description directly above each script

#Create a list containing data frames with 1 row removed from each

data_rem<-list() 
for (i in 1:nrow(data)){
data_rem[[i]]<-data[-i,]}

#Match each row of data with all rows of corresponding data frame in data_rem and count the number of TRUEs. For eg:length(grep("TRUE",data[1,]==data_rem[[1]][1,])),length(grep("TRUE",data[2,]==data_rem[[2]][1,])) etc#

row_match<-list()
for (i in 1:nrow(data)){
jnk<-as.numeric()
for (j in 1:length(data_rem)){
jnk[j]<-length(grep("TRUE",data[i,]==data_rem[[i]][j,]))
row_match[[i]]<-jnk}}

#Find out which position of row_match contains 3 TRUEs, 0==not 3 TRUE, 1==3 TRUEs

dup_pos<-as.numeric()
for (i in 1:length(row_match)){
dup_pos[i]<-length(which(row_match[[i]]==3))}

#Conditional print based on presence/absence of duplicates and returns vector with duplicate lists

if (length(which(dup_pos==1))==0){
print("No duplicate found")
} else {
print("Duplicate found")
print(dup_list)
dup_list<-data[ which(dup_pos==1),]}

Although I have explained the broad strategy of these scripts, I would recommend you to examine the functions in each script to understand better.