Home > Software engineering >  Find values that appear more than once in the same column or in another column
Find values that appear more than once in the same column or in another column

Time:01-16

I have two columns and would like to find the values where any value in one column is matched by any other row's same column or other column value.

This is my data:

my_data <- data.frame(
  storm_name = c("DEAN_2007","FELIX_2007","GUSTAV_2008","IKE_2008","PALOMA_2008","EARL_2010","IGOR_2010","KATIA_2011","JOAQUIN_2015","MATTHEW_2016","HARVEY_2017","IRMA_2017","MARIA_2017","FLORENCE_2018","MICHAEL_2018","DORIAN_2019","LORENZO_2019"),
  year_month_min = c("2007-08","2007-08","2008-08","2008-09","2008-11","2010-08","2010-09","2011-08","2015-09","2016-09","2017-08","2017-08","2017-09","2018-08","2018-10","2019-08","2019-09"),
  year_month_max = c("2007-08","2007-09","2008-09","2008-09","2008-11","2010-09","2010-09","2011-09","2015-10","2016-10","2017-09","2017-09","2017-10","2018-09","2018-10","2019-09","2019-10"))

In the context of my data, this means I want to find the year_month_min and year_month_max values where more than one storm took place. In other words, find the year_month values where

  • a storm's year_month_min value is matched by another storm's year_month_min value
  • a storm's year_month_max value is matched by another storm's year_month_max value
  • a storm's year_month_min value is matched by another storm's year_month_max value

The first two of these could easily be achieved with something like

my_data %>% dplyr::add_count(year_month_min) %>% dplyr::filter(n>1)

and

my_data %>% dplyr::add_count(year_month_max) %>% dplyr::filter(n>1)

However, I struggle with the third one as I don't know a way to compare values between columns like this.

Any help would be greatly appreciated!

CodePudding user response:

Something like this:

library(dplyr)
library(tidyr)

my_data %>% 
  pivot_longer(-storm_name) %>% 
  group_by(name, value) %>% 
  add_count() %>% 
  filter(n()>1) 
   storm_name  name           value       n
   <chr>       <chr>          <chr>   <int>
 1 DEAN_2007   year_month_min 2007-08     2
 2 FELIX_2007  year_month_min 2007-08     2
 3 GUSTAV_2008 year_month_max 2008-09     2
 4 IKE_2008    year_month_max 2008-09     2
 5 EARL_2010   year_month_max 2010-09     2
 6 IGOR_2010   year_month_max 2010-09     2
 7 HARVEY_2017 year_month_min 2017-08     2
 8 HARVEY_2017 year_month_max 2017-09     2
 9 IRMA_2017   year_month_min 2017-08     2
10 IRMA_2017   year_month_max 2017-09     2

CodePudding user response:

library(tidyverse)

my_data %>%  
  pivot_longer(-storm_name) %>%  
  add_count(value, name = "count") %>%  
  pivot_wider(names_from = name, values_from = c(value, count)) 

# A tibble: 17 × 5
   storm_name    value_year_month_min value_year_month_max count_year_month_min count_year_month_max
   <chr>         <chr>                <chr>                               <int>                <int>
 1 DEAN_2007     2007-08              2007-08                                 3                    3
 2 FELIX_2007    2007-08              2007-09                                 3                    1
 3 GUSTAV_2008   2008-08              2008-09                                 1                    3
 4 IKE_2008      2008-09              2008-09                                 3                    3
 5 PALOMA_2008   2008-11              2008-11                                 2                    2
 6 EARL_2010     2010-08              2010-09                                 1                    3
 7 IGOR_2010     2010-09              2010-09                                 3                    3
 8 KATIA_2011    2011-08              2011-09                                 1                    1
 9 JOAQUIN_2015  2015-09              2015-10                                 1                    1
10 MATTHEW_2016  2016-09              2016-10                                 1                    1
11 HARVEY_2017   2017-08              2017-09                                 2                    3
12 IRMA_2017     2017-08              2017-09                                 2                    3
13 MARIA_2017    2017-09              2017-10                                 3                    1
14 FLORENCE_2018 2018-08              2018-09                                 1                    1
15 MICHAEL_2018  2018-10              2018-10                                 2                    2
16 DORIAN_2019   2019-08              2019-09                                 1                    2
17 LORENZO_2019  2019-09              2019-10                                 2                    1

CodePudding user response:

You can separate year_month_min and year_month_max into different dataframes and match them with inner_join(). The suffix also implies which year_month of a storm is matched.

library(dplyr)

inner_join(
  select(my_data, year_month = year_month_min, storm_name),
  select(my_data, year_month = year_month_max, storm_name),
  by = "year_month", suffix = c(".min", ".max")
) %>%
  filter(storm_name.min != storm_name.max) %>%
  add_count(year_month)

#   year_month storm_name.min storm_name.max n
# 1    2007-08     FELIX_2007      DEAN_2007 1
# 2    2008-09       IKE_2008    GUSTAV_2008 1
# 3    2010-09      IGOR_2010      EARL_2010 1
# 4    2017-09     MARIA_2017    HARVEY_2017 2
# 5    2017-09     MARIA_2017      IRMA_2017 2
# 6    2019-09   LORENZO_2019    DORIAN_2019 1
  • Related