Home > Enterprise >  Subset distinct rows based on column value closest from 0
Subset distinct rows based on column value closest from 0

Time:11-24

I was wondering if someone had an efficient way to remove distinct rows based on a column value from the same dataframe. I want to keep the row in which the vlaue if this column is the closest from 0.

For exemple I have this dataframe :

df = data.frame(cond1=c("a","a","a","b","b"),cond2=c(1,1,2,3,3),value=c(10,-20,5,-5,12))
df
cond1 cond2 value
a     1     10
a     1     -20
a     2     5
b     3     -5
b     3     12

What I would like to obtain is removing rows with the same cond1 and cond2 with the farthest value from 0:

cond1 cond2 value
a     1     10
a     2     5
b     3     -5

In the cases value == -value I would ideally keep the two rows in my dataframe. Would you have any suggestion to overcome my problem ? I was thinking about combining group_by, arrange, and filter from dplyr but I struggle to make it work with my condition. Thanks

CodePudding user response:

A variation on the other answers which works for your example data. Note that another way of saying "the closest from 0" (I assume you mean the closest to 0) is simply the minimum of the absolute values.

library(dplyr)
df %>% 
  group_by(cond1, cond2) %>% 
  filter(abs(value) == min(abs(value))) %>%
  ungroup()

Result:

# A tibble: 4 × 3
  cond1 cond2 value
  <chr> <dbl> <dbl>
1 a         1    10
2 a         2     5
3 b         3    -5

And if we alter df so that the second (a, 1) = -10 we get:

# A tibble: 4 × 3
  cond1 cond2 value
  <chr> <dbl> <dbl>
1 a         1    10
2 a         1   -10
3 a         2     5
4 b         3    -5

CodePudding user response:

Using dplyr, we have :

df %>%
  mutate(dist = abs(0 - value)) %>%
  group_by(cond1, cond2) %>%
  filter(dist == min(dist)) %>%
  select(-dist)

Output:

# A tibble: 3 x 3
# Groups:   cond1, cond2 [3]
  cond1 cond2 value
  <chr> <dbl> <dbl>
1 a         1    10
2 a         2     5
3 b         3    -5

For the -value == value condition, This works as well:

Data:

structure(list(cond1 = c("a", "a", "a", "b", "b"), cond2 = c(1, 
1, 2, 3, 3), value = c(10, -10, 5, -5, 12)), row.names = c(NA, 
-5L), class = "data.frame")

  cond1 cond2 value
1     a     1    10
2     a     1   -10
3     a     2     5
4     b     3    -5
5     b     3    12

Code:

  df %>%
  mutate(dist = abs(0 - value)) %>%
  group_by(cond1, cond2) %>%
  filter(dist == min(dist)) %>%
  select(-dist)

Output:

  cond1 cond2 value
  <chr> <dbl> <dbl>
1 a         1    10
2 a         1   -10
3 a         2     5
4 b         3    -5
  • Related