Delete/overwrite rows by partial matching-CodePudding

I need to check if rows are partially duplicated and delete/overwrite those where 2 columns match a different row where 3 values are present. one problem is, that the "real" dataframe contains a couple of list columns which makes some operations unfeasible. Best case would be if any row where a match can be found would be checked independently of column numbers - meaning only the row with the most columns having non NA values (out of all which include matching column values) is kept.

   o1 o2 o3
1   1 NA NA
2   2 NA NA
3   3 NA NA
4   4 NA NA
5   6 NA NA
6   7 NA NA
7   5  9 NA # this row has only 2 values which match values from row 11 but the last value is na
8  10 NA NA
9  12 NA NA
10 13 NA NA
11  5  9 14 # this row has values in all 3 columns 
12 14 NA NA
13  8 11 15 # so does this row
14 16 NA NA
15 17 NA NA
16 18 NA NA
17 19 NA NA
18 20 NA NA

The result should be the same data frame - just without row 7 or where row 7 is overwritten by row 11.

This should be easy to do but for some reason i didn't manage it (except with a convoluted for loop that is hard to generalize should more columns be added at a later time). Is there a straight forward way to do this?

dput of above df:

structure(list(o1 = c(1L, 2L, 3L, 4L, 6L, 7L, 5L, 10L, 12L, 13L, 
5L, 14L, 8L, 16L, 17L, 18L, 19L, 20L), o2 = c(NA, NA, NA, NA, 
NA, NA, 9L, NA, NA, NA, 9L, NA, 11L, NA, NA, NA, NA, NA), o3 = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 14L, NA, 15L, NA, NA, NA, 
NA, NA)), row.names = c(NA, -18L), class = "data.frame")

If there is already an answer for something like this, please let me know.

CodePudding user response：

I thought of using dplyr:

library(dplyr)

df %>% 
  mutate(rn = row_number(),
         count_na = rowSums(across(o1:o3, is.na))) %>% 
  group_by(o1, o2) %>% 
  slice_min(count_na) %>% 
  arrange(rn) %>% 
  ungroup() %>% 
  select(o1:o3)

This returns

# A tibble: 17 x 3
      o1    o2    o3
   <int> <int> <int>
 1     1    NA    NA
 2     2    NA    NA
 3     3    NA    NA
 4     4    NA    NA
 5     6    NA    NA
 6     7    NA    NA
 7    10    NA    NA
 8    12    NA    NA
 9    13    NA    NA
10     5     9    14
11    14    NA    NA
12     8    11    15
13    16    NA    NA
14    17    NA    NA
15    18    NA    NA
16    19    NA    NA
17    20    NA    NA

This solution is based on the following ideas:

For every row we count the number of NAs in this row.
We group for o1 and o2 to create groups of data that belong together. Here is a possible flaw: perhaps it is a better approach to group by o1 only or do some other grouping. This depends on the structure of your data: should 1, <NA>, <NA> be overwritten by 1, 2, <NA>?
After grouping, we select the row with the smallest number of NAs.
Finally we do some clean up: removing the auxiliary columns, arranging the data and ungrouping.

CodePudding user response：

A partial solution to detect the duplicates, it remains to specify which rows to delete, ran out of time. I've went ahead and "duplicated" a couple more rows.

df=read.table(text="
   o1 o2 o3
1   1 NA NA
2   2 NA NA
3   3 NA NA
4   4 NA NA
5   6 NA NA
6   7 NA NA
7   5  9 NA
8  10 NA NA
9  12 NA NA
10 13 NA NA
11  5  9 14
12 14 NA NA
13  8 11 15
14 16 NA NA
15 7 1 2
16 18 NA NA
17 7 1 3
18 20 NA NA",h=T)

The main trick is to calculate a distance matrix and check which rows have a distance of zero, since dist will automatically estimate a pairwise distance, removing missing values.

tmp=as.matrix(dist(df))
diag(tmp)=NA
tmp[lower.tri(tmp)]=NA

tod=data.frame(which(tmp==0,arr.ind=T))

resulting in

     row col
X7     7  11
X6     6  15
X6.1   6  17

CodePudding user response：

Here's another way which considers all columns, should work with any number of columns and regardless of their names or positions

library(dplyr)
mydf <- structure(list(o1 = c(1L, 2L, 3L, 4L, 6L, 7L, 5L, 10L, 12L, 13L, 
                              5L, 14L, 8L, 16L, 17L, 18L, 19L, 20L), 
                       o2 = c(NA, NA, NA, NA, 
                              NA, NA, 9L, NA, NA, NA, 9L, NA, 11L, NA, NA, NA, NA, NA), 
                       o3 = c(NA, 
                              NA, NA, NA, NA, NA, NA, NA, NA, NA, 14L, NA, 15L, NA, NA, NA, 
                              NA, NA)), 
                  row.names = c(NA, -18L), 
                  class = "data.frame")

columns <- names(mydf)
dummy_cols <- paste0(columns, "_dummy")
mydf %>% 
  # duplicate the dataframe
  cbind(mydf %>% `names<-`(dummy_cols)) %>% 
  # arrange across all columns
  arrange(across(columns)) %>% 
  # fill NAs downwards
  tidyr::fill(dummy_cols, .direction = "down") %>% 
  # create a dummy ID 
  tidyr::unite(id_dummy, dummy_cols, sep = "") %>% 
  # group by the id 
  group_by(id_dummy) %>% 
  # get the first row of each
  filter(row_number()==1) %>% 
  ungroup() %>% 
  select(columns)

P.S. also replaces 1 - NA - NA with 1 - 2 - NA and replaces 1 - NA - NA with 1 - NA - 3