how to find data that occur in different years but given the same name in R-CodePudding

i'am using “storms” tibble in dplyr package,in R i want to know

if there are storms that occur in different years but given the same name?
if any Which storm names, were reused in which years?

for example:

name            year
--------      ----------- 
 Alberto         1997
 Alberto         2001
 Gordon          1993
 Felix           2000

so Alberto display in different years and have the same name

CodePudding user response：

This code will return all storms who have had names reused more than once, instead of only returning the names along with the number of times they were used in the year.

library(dplyr)
library(tidyr)

storms %>% 
  select(name, year) %>% 
  distinct() %>% 
  group_by(name, year) %>% 
  summarise(Count = n()) %>% 
  group_by(name) %>% 
  filter(n() > 1) %>% 
  select(-Count)
  

--- Output

# Groups:   name [106]
   name     year
   <chr>   <dbl>
 1 Alberto  1982
 2 Alberto  1988
 3 Alberto  1994
 4 Alberto  2000
 5 Alberto  2006
 6 Alberto  2012
 7 Alberto  2018
 8 Alex     1998
 9 Alex     2004
10 Alex     2010

To get the actual list of names itself


storms %>% 
  select(name, year) %>% 
  distinct() %>% 
  group_by(name, year) %>% 
  summarise(Count = n()) %>% 
  group_by(name) %>% 
  filter(n() > 1) %>% 
  select(-Count) %>% 
  pull(name) %>% 
  unique()

--- Output


`summarise()` has grouped output by 'name'. You can override using the `.groups` argument.
  [1] "Alberto"   "Alex"      "Allison"   "Ana"       "Andrew"    "Arthur"    "Barry"     "Beryl"     "Beta"      "Bill"      "Bob"      
 [12] "Bonnie"    "Cesar"     "Chantal"   "Charley"   "Chris"     "Claudette" "Colin"     "Cristobal" "Danielle"  "Danny"     "Dean"     
 [23] "Debby"     "Diana"     "Don"       "Dorian"    "Edouard"   "Eight"     "Emily"     "Epsilon"   "Erika"     "Erin"      "Ernesto"  
 [34] "Fabian"    "Fay"       "Felix"     "Fernand"   "Fifteen"   "Fiona"     "Floyd"     "Franklin"  "Fred"      "Gabrielle" "Gamma"    
 [45] "Gaston"    "Georges"   "Gert"      "Gloria"    "Gonzalo"   "Gordon"    "Gustav"    "Hanna"     "Harvey"    "Henri"     "Hermine"  
 [56] "Hortense"  "Humberto"  "Ida"       "Ingrid"    "Iris"      "Isaac"     "Isabel"    "Isidore"   "Ivan"      "Jeanne"    "Jerry"    
 [67] "Josephine" "Joyce"     "Juan"      "Julia"     "Karen"     "Karl"      "Kate"      "Katia"     "Katrina"   "Keith"     "Kirk"     
 [78] "Klaus"     "Kyle"      "Lee"       "Leslie"    "Lili"      "Lisa"      "Lorenzo"   "Marco"     "Maria"     "Matthew"   "Melissa"  
 [89] "Michael"   "Nadine"    "Nana"      "Nate"      "Nicole"    "Noel"      "Olga"      "Omar"      "Ophelia"   "Oscar"     "Otto"     
[100] "Pablo"     "Philippe"  "Rina"      "Sebastien" "Ten"       "Two"       "Zeta"

CodePudding user response：

A simple aggregate solution in base-r

storms_by_year <- aggregate(year~name, data=storms, \(y) paste(unique(y), collapse="|"))

> tail(storms_by_year)
       name           year
209     Two      2010|2014
210   Vicky           2020
211   Vince           2005
212 Wilfred           2020
213   Wilma           2005
214    Zeta 2005|2006|2020

The storms that occur in multiple years are simply those with a long string in year

> tail(storms_by_year[nchar(storms_by_year$year)>4,])
         name                year
187  Philippe      2005|2011|2017
191      Rina           2011|2017
197 Sebastien           1995|2019
204       Ten 2005|2007|2011|2020
209       Two           2010|2014
214      Zeta      2005|2006|2020

CodePudding user response：

dplyr::storms %>%
  count(name, year) %>%  
  arrange(name)

# A tibble: 512 x 3
   name      year     n
   <chr>    <dbl> <int>
 1 AL011993  1993     8
 2 AL012000  2000     4
 3 AL021992  1992     5
 4 AL021994  1994     6
 5 AL021999  1999     4
 6 AL022000  2000    12
 7 AL022001  2001     5
 8 AL022003  2003     4
 9 AL022006  2006     5
10 AL031987  1987    32
# ... with 502 more rows

For Alberto for example:

dplyr::storms %>%
  count(name, year) %>%  
  arrange(name) %>%  
  filter(name == "Alberto")

# A tibble: 7 x 3
  name     year     n
  <chr>   <dbl> <int>
1 Alberto  1982    17
2 Alberto  1988    11
3 Alberto  1994    32
4 Alberto  2000    79
5 Alberto  2006    18
6 Alberto  2012    13
7 Alberto  2018    14

Or, with distinct()

dplyr::storms %>%
  distinct(name, year) %>%  
  arrange(name)

CodePudding user response：

A simple way would be to count by name and year, then take the result and group by name, filtering out those groups with only a single entry:

storms %>% 
  count(name, year) %>%
  group_by(name) %>%
  filter(n() > 1) %>%
  select(-n)
#> # A tibble: 404 x 2
#> # Groups:   name [106]
#>    name     year
#>    <chr>   <dbl>
#>  1 Alberto  1982
#>  2 Alberto  1988
#>  3 Alberto  1994
#>  4 Alberto  2000
#>  5 Alberto  2006
#>  6 Alberto  2012
#>  7 Alberto  2018
#>  8 Alex     1998
#>  9 Alex     2004
#> 10 Alex     2010
#> # ... with 394 more rows

^{Created on 2022-11-14 with reprex v2.0.2}

CodePudding user response：

Summarize data, check for duplicates, arrange:

library(dplyr)

storms %>% 
  distinct(name, year) %>% 
  filter(duplicated(name) | duplicated(name, fromLast = TRUE)) %>% 
  arrange(name)

# # A tibble: 404 × 2
#    name     year
#    <chr>   <dbl>
#  1 Alberto  1982
#  2 Alberto  1988
#  3 Alberto  1994
#  4 Alberto  2000
#  5 Alberto  2006
#  6 Alberto  2012
#  7 Alberto  2018
#  8 Alex     1998
#  9 Alex     2004
# 10 Alex     2010
# # … with 394 more rows
# # ℹ Use `print(n = ...)` to see more rows

@Ottie's summarized format:

library(dplyr)

storms %>% 
  distinct(name, year) %>% 
  group_by(name) %>% 
  filter(n() > 1) %>% 
  summarize(year = paste(year, collapse = "|"))

# # A tibble: 106 × 2
#    name    year 
#    <chr>   <chr>                             
#  1 Alberto 1982|1988|1994|2000|2006|2012|2018
#  2 Alex    1998|2004|2010|2016               
#  3 Allison 1989|1995|2001                    
#  4 Ana     1979|1985|1991|1997|2003|2009|2015
#  5 Andrew  1986|1992                         
#  6 Arthur  1984|1990|1996|2002|2008|2014|2020
#  7 Barry   1983|1989|1995|2001|2007|2013|2019
#  8 Beryl   1982|1988|1994|2000|2006|2012|2018
#  9 Beta    2005|2020                         
# 10 Bill    1997|2003|2009|2015               
# # … with 96 more rows
# # ℹ Use `print(n = ...)` to see more rows