Remove duplicate after grouping data in R-CodePudding

I have a table looks like this.

   Year District week_number
1  2020       11          12
2  2020        2          12
3  2020        2          12
4  2020        2          12
5  2020        7          12
6  2020        2          12
7  2020       12          12
8  2020        4          12
9  2020       10          12
10 2020        2          12
11 2020       22          12
12 2020        6          12
13 2020        3          12
14 2020        9          12
15 2020        3          12

I want to sum number of district by week_number, so i use formula below. A new total column is created as well.

crime4 = crime3 %>% 
 group_by(District) %>% 
 mutate(Total = sum(week_number, na.rm=TRUE))  %>%
 ungroup()

However, the result returns with duplicates in District.

  Year  District week_number  Total
   <chr>    <int>       <dbl>  <dbl>
 1 2020        11          12 607260
 2 2020         2          12 436255
 3 2020         2          12 436255
 4 2020         2          12 436255
 5 2020         7          12 605288
 6 2020         2          12 436255
 7 2020        12          12 483924
 8 2020         4          12 628875
 9 2020        10          12 467589
10 2020         2          12 436255

Any better method i can get the Total by not having duplicates in District? It should be a simple solution but i just can't crack it

CodePudding user response：

If you want the number of distinct district by week (currently you have kind of the opposite!) you can do:

library(tidyverse)
crime3 %>%
  group_by(week_number) %>%
  summarize(n_districts = length(unique(District)))

CodePudding user response：

Do either of these approaches give you your desired outcome?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

crime3 <- read.table(text = "   Year District week_number
1  2020       11          12
2  2020        2          12
3  2020        2          12
4  2020        2          12
5  2020        7          12
6  2020        2          12
7  2020       12          12
8  2020        4          12
9  2020       10          12
10 2020        2          12
11 2020       22          12
12 2020        6          12
13 2020        3          12
14 2020        9          12
15 2020        3          12",
header = TRUE)

# One potential option:
crime3 %>%
  group_by(District) %>% 
  summarise(Total = sum(week_number, na.rm=TRUE))
#> # A tibble: 10 × 2
#>    District Total
#>       <int> <int>
#>  1        2    60
#>  2        3    24
#>  3        4    12
#>  4        6    12
#>  5        7    12
#>  6        9    12
#>  7       10    12
#>  8       11    12
#>  9       12    12
#> 10       22    12

# Or perhaps:
crime3 %>% 
  group_by(District) %>% 
  mutate(Total = sum(week_number, na.rm=TRUE))  %>%
  ungroup() %>%
  distinct()
#> # A tibble: 10 × 4
#>     Year District week_number Total
#>    <int>    <int>       <int> <int>
#>  1  2020       11          12    12
#>  2  2020        2          12    60
#>  3  2020        7          12    12
#>  4  2020       12          12    12
#>  5  2020        4          12    12
#>  6  2020       10          12    12
#>  7  2020       22          12    12
#>  8  2020        6          12    12
#>  9  2020        3          12    24
#> 10  2020        9          12    12

^{Created on 2022-08-25 by the reprex package (v2.0.1)}