Home > other >  R duplicated rows still remain after distinct
R duplicated rows still remain after distinct

Time:06-06

I am trying to remove duplicated rows in my data frame, but either distinct(d) or filter(duplicated(d)) does not remove the duplicated rows (where d is the data frame name with duplicated rows) -- the functions do not recognize the duplicated rows. Is there any common reason why this happens?

Below is the example dataset using dput.

structure(list(id.case = c("114746", "114746", "114746", "114746", 
"114746", "114746", "114746", "114746", "114746", "114746", "114746", 
"114746", "114746", "114746", "114746", "114746", "114746", "114746", 
"114746", "114746"), id.pair = c("78272-10794", "9330-10794", 
"9330-10794", "80739-42071", "80739-42071", "42114-10794", "42114-10794", 
"84701-42114", "84701-42114", "5533-42071", "5533-42071", "8876-5533", 
"8876-5533", "5652-42114", "5652-42114", "8920-5652", "8920-5652", 
"78272-5533", "78272-5533", "9114-78272"), e1.conditional.dyad = c(1.07224025692901, 
0.568380969299369, 0.568380969302098, 0.252545406662165, 0.252545406663273, 
-1.21808723071715, -1.21808723071797, -4.1477891182987, -4.14778911829956, 
-1.48315629665277, -1.48315629665359, -1.3047217588809, -1.30472175888309, 
-1.63547814316539, -1.63547814316453, -0.671008645771849, -0.671008645772957, 
-0.0801843233972761, -0.0801843233964519, 2.30874742062369)), row.names = c(NA, 
20L), class = "data.frame")

I am trying to use the below code.

d %>% distinct

CodePudding user response:

Up front: your numbers are not exactly the same, see

d[2:3,]
#   id.case    id.pair e1.conditional.dyad
# 2  114746 9330-10794            0.568381
# 3  114746 9330-10794            0.568381
diff(d[2:3,3])
# [1] 2.729039e-12

Computers have limitations when it comes to floating-point numbers (aka double, numeric, float). This is a fundamental limitation of computers in general, in how they deal with non-integer numbers. This is not specific to any one programming language. There are some add-on libraries or packages that are much better at arbitrary-precision math, but I believe most main-stream languages (this is relative/subjective, I admit) do not use these by default. Refs: Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754

To continue using distinct without losing the actual precision of your values, try

d %>%
  distinct(id.case, id.pair, ign = round(e1.conditional.dyad, 8), .keep_all = TRUE) %>%
  select(-ign)
#    id.case     id.pair e1.conditional.dyad
# 1   114746 78272-10794          1.07224026
# 2   114746  9330-10794          0.56838097
# 3   114746 80739-42071          0.25254541
# 4   114746 42114-10794         -1.21808723
# 5   114746 84701-42114         -4.14778912
# 6   114746  5533-42071         -1.48315630
# 7   114746   8876-5533         -1.30472176
# 8   114746  5652-42114         -1.63547814
# 9   114746   8920-5652         -0.67100865
# 10  114746  78272-5533         -0.08018432
# 11  114746  9114-78272          2.30874742

where the decision to use 8 digits is arbitrary (here) and sensitive to your knowledge of the data.

CodePudding user response:

The problem is that your numeric column doesn't have duplicates because of the many digits. So if you round that column, you can remove the duplicates if you want like this:

d$e1.conditional.dyad <- round(d$e1.conditional.dyad, digits = 4)
d %>% distinct()

Output:

   id.case     id.pair e1.conditional.dyad
1   114746 78272-10794              1.0722
2   114746  9330-10794              0.5684
3   114746 80739-42071              0.2525
4   114746 42114-10794             -1.2181
5   114746 84701-42114             -4.1478
6   114746  5533-42071             -1.4832
7   114746   8876-5533             -1.3047
8   114746  5652-42114             -1.6355
9   114746   8920-5652             -0.6710
10  114746  78272-5533             -0.0802
11  114746  9114-78272              2.3087

CodePudding user response:

Here's one approach (but I'm sure there will be better ones), the trick being that you first collapse the whole dataframe into a single diagnostic helper column, on which you then use the duplicated function:

d %>%
  mutate(diagnost = apply(d, 1, paste0, collapse = "")) %>%
  filter(!duplicated(diagnost)) %>%
  select(-diagnost)
   id.case     id.pair e1.conditional.dyad
1   114746 78272-10794          1.07224026
2   114746  9330-10794          0.56838097
3   114746 80739-42071          0.25254541
4   114746 42114-10794         -1.21808723
5   114746 84701-42114         -4.14778912
6   114746  5533-42071         -1.48315630
7   114746   8876-5533         -1.30472176
8   114746  5652-42114         -1.63547814
9   114746   8920-5652         -0.67100865
10  114746  78272-5533         -0.08018432
11  114746  9114-78272          2.30874742
  • Related