tidyr::complete behaves oddly depending on column class and rounding specification-CodePudding

I have a data frame of depths for a number of different stations in a lake, and I want every station to have a complete sequence of depths from the min to the max (and missing values filled with NAs).

I am using tidyr::complete to do this, but it is behaving oddly. When my depths are rounded to zero decimal places, the code runs as expected, but when the data are to the tenth of a meter, something very odd happens to the class of depth and some combinations are completed (and values filled with NA) even though I already have data for that depth. An extra step is then required to remove those NAs.

Has anyone experienced this before? I assume it has something to do with the class of depth, but I haven't quite figured it out or how to avoid it.

library(dplyr)

b <- data.frame(site = c(rep("A", 10), rep("B", 10)),
                depth = c(seq(0.1, 0.8, 0.1), 1.0, 1.1, seq(0.3, 0.5, 0.1), seq(0.9, 1.5, 0.1)),
                value = round(runif(20, 0, 5), 1))

str(b)

b2 <- b %>% 
  mutate(site = factor(site)) %>% 
  group_by(site) %>% 
  tidyr::complete(depth = seq(min(depth),
                              max(depth),
                              by = 0.1)) %>% 
  arrange(site, depth)

class(b2$depth) # R thinks the class is numeric but...
unique(b2$depth) # values are not in order and some NA values have been added where I already had data (e.g., site B, depth 9)

# When I convert out of numeric and back to numeric, the class seems to have changed back to what I would expect, although I still need to remove the NAs I don't want.
class(as.numeric(as.character(b2$depth))) 
unique(as.numeric(as.character(b2$depth)))

# If depths have no decimal places, the behaviour seems more predictable, with no repeated depths, and the class of depth behaving as a numeric.

a <- data.frame(site = c(rep("A", 10), rep("B", 10)),
                depth = c(1:4, 6:11, 3:5, 8, 10, 12:16),
                value = round(runif(20, 0, 5), 1))

str(a)

a2 <- a %>% 
  mutate(site = factor(site)) %>% 
  group_by(site) %>% 
  tidyr::complete(depth = seq(min(depth),
                              max(depth),
                              by = 1)) %>% 
  arrange(site, depth)

class(a2$depth)
unique(a2$depth) # This is what I expect; no duplicates, printed in order, etc.

CodePudding user response：

Indeed, complete() is producing extra unexplained elements.

I think, when complete() matches the numbers produced by the sequence and the numbers in the original data, it evaluates (1.4 == 1.4) as FALSE and returns them as two different numbers. I did a small test and I indeed got FALSE too! To avoid this we need to round and make sure the "depth" values in the original dataframe and those created by seq() are the same.

b <- data.frame(site = c(rep("A", 10), rep("B", 10)),
                depth = c(seq(0.1, 0.8, 0.1), 1.0, 1.1, seq(0.3, 0.5, 0.1), seq(0.9, 1.5, 0.1)),
                value = round(runif(20, 0, 5), 1))

b$depth <- round(b$depth,2) #crucial to do it here and not via mutate()

b2 <- b %>%
  mutate(site = factor(site)) %>%
  group_by(site) %>% 
  tidyr::complete(depth = round(seq(min(depth),
                                max(depth),
                                by = 0.1), 
                                2)) %>% 
  ungroup()

output of site B to check all is good.

> b2[b2$site == "B",]
# A tibble: 13 × 3
   site  depth value
   <chr> <chr> <dbl>
 1 B     0.3     3.4
 2 B     0.4     0.7
 3 B     0.5     4.5
 4 B     0.6    NA  
 5 B     0.7    NA  
 6 B     0.8    NA  
 7 B     0.9     1.4
 8 B     1       1.2
 9 B     1.1     1.3
10 B     1.2     3.3
11 B     1.3     2.1
12 B     1.4     0.9
13 B     1.5     1.2

The problem, as I understand it, is due to the way seq() stores the values it produces. Notice how the values are not exact:

> v2 = seq(.3,1.4, by = .1)
> print(v2, digits =16)
 [1] 0.3000000000000000 0.4000000000000000 0.5000000000000000 0.6000000000000001
 [5] 0.7000000000000000 0.8000000000000000 0.9000000000000001 1.0000000000000000
 [9] 1.1000000000000001 1.2000000000000000 1.3000000000000000 1.3999999999999999

Since 1.3999... is not equals to 1.4, complete() did not join the two 1.4 values. That's how the duplicates got produced.