Home > Software engineering >  Group column values by a set numeric difference in R (big dataset)
Group column values by a set numeric difference in R (big dataset)

Time:09-18

I want a script to put column values in groups with a maximum range of 10. So that all values in the group are <10 different from each other. I'm trying to get a script that defines the groups, counts how many values are in each group, and find the mean.

So if I had this df:

cat1 = c(85, 60, 60, 55, 55, 15, 0, 35, 35 )
cat2 = c("a","a","a","a","a","a","a","a","a")
df <- data.frame(cat1, cat2)

  cat1 cat2
1   85    a
2   60    a
3   60    a
4   55    a
5   55    a
6   15    a
7    0    a
8   35    a 
9   35    a

the output would be:

numValues cat1avg 
      1    85
      4    57.5
      1    15
      1     0
      2    35

I followed the top-rated answer enter image description here

We could see how much range is in each cluster produced in each case, and find the widest range cluster for each k. We see that dividing into four groups would have at least 15 range (see the 4 facet in the chart above), but 5 groups would be adequate to keep the within-cluster range under 5.

assignments %>%
  group_by(k, .cluster) %>%
  summarize(range = max(cat1) - min(cat1)) %>%
  summarize(max_range = max(range))

# A tibble: 6 × 2
      k max_range
  <int>     <dbl>
1     1        85
2     2        35
3     3        30
4     4        15
5     5         5
6     6         0

And finally:

assignments %>%
  filter(k == 5) %>%
  group_by(.cluster) %>%
  summarize(numValues = n(),
            cat1avg = mean(cat1))


  .cluster numValues cat1avg
  <fct>        <int>   <dbl>
1 1                1    85  
2 2                1    15  
3 3                2    35  
4 4                4    57.5
5 5                1     0  
  • Related