I have two columns and would like to find the values where any value in one column is matched by any other row's same column or other column value.
This is my data:
my_data <- data.frame(
storm_name = c("DEAN_2007","FELIX_2007","GUSTAV_2008","IKE_2008","PALOMA_2008","EARL_2010","IGOR_2010","KATIA_2011","JOAQUIN_2015","MATTHEW_2016","HARVEY_2017","IRMA_2017","MARIA_2017","FLORENCE_2018","MICHAEL_2018","DORIAN_2019","LORENZO_2019"),
year_month_min = c("2007-08","2007-08","2008-08","2008-09","2008-11","2010-08","2010-09","2011-08","2015-09","2016-09","2017-08","2017-08","2017-09","2018-08","2018-10","2019-08","2019-09"),
year_month_max = c("2007-08","2007-09","2008-09","2008-09","2008-11","2010-09","2010-09","2011-09","2015-10","2016-10","2017-09","2017-09","2017-10","2018-09","2018-10","2019-09","2019-10"))
In the context of my data, this means I want to find the year_month_min and year_month_max values where more than one storm took place. In other words, find the year_month values where
- a storm's year_month_min value is matched by another storm's year_month_min value
- a storm's year_month_max value is matched by another storm's year_month_max value
- a storm's year_month_min value is matched by another storm's year_month_max value
The first two of these could easily be achieved with something like
my_data %>% dplyr::add_count(year_month_min) %>% dplyr::filter(n>1)
and
my_data %>% dplyr::add_count(year_month_max) %>% dplyr::filter(n>1)
However, I struggle with the third one as I don't know a way to compare values between columns like this.
Any help would be greatly appreciated!
CodePudding user response:
Something like this:
library(dplyr)
library(tidyr)
my_data %>%
pivot_longer(-storm_name) %>%
group_by(name, value) %>%
add_count() %>%
filter(n()>1)
storm_name name value n
<chr> <chr> <chr> <int>
1 DEAN_2007 year_month_min 2007-08 2
2 FELIX_2007 year_month_min 2007-08 2
3 GUSTAV_2008 year_month_max 2008-09 2
4 IKE_2008 year_month_max 2008-09 2
5 EARL_2010 year_month_max 2010-09 2
6 IGOR_2010 year_month_max 2010-09 2
7 HARVEY_2017 year_month_min 2017-08 2
8 HARVEY_2017 year_month_max 2017-09 2
9 IRMA_2017 year_month_min 2017-08 2
10 IRMA_2017 year_month_max 2017-09 2
CodePudding user response:
library(tidyverse)
my_data %>%
pivot_longer(-storm_name) %>%
add_count(value, name = "count") %>%
pivot_wider(names_from = name, values_from = c(value, count))
# A tibble: 17 × 5
storm_name value_year_month_min value_year_month_max count_year_month_min count_year_month_max
<chr> <chr> <chr> <int> <int>
1 DEAN_2007 2007-08 2007-08 3 3
2 FELIX_2007 2007-08 2007-09 3 1
3 GUSTAV_2008 2008-08 2008-09 1 3
4 IKE_2008 2008-09 2008-09 3 3
5 PALOMA_2008 2008-11 2008-11 2 2
6 EARL_2010 2010-08 2010-09 1 3
7 IGOR_2010 2010-09 2010-09 3 3
8 KATIA_2011 2011-08 2011-09 1 1
9 JOAQUIN_2015 2015-09 2015-10 1 1
10 MATTHEW_2016 2016-09 2016-10 1 1
11 HARVEY_2017 2017-08 2017-09 2 3
12 IRMA_2017 2017-08 2017-09 2 3
13 MARIA_2017 2017-09 2017-10 3 1
14 FLORENCE_2018 2018-08 2018-09 1 1
15 MICHAEL_2018 2018-10 2018-10 2 2
16 DORIAN_2019 2019-08 2019-09 1 2
17 LORENZO_2019 2019-09 2019-10 2 1
CodePudding user response:
You can separate year_month_min
and year_month_max
into different dataframes and match them with inner_join()
. The suffix also implies which year_month
of a storm is matched.
library(dplyr)
inner_join(
select(my_data, year_month = year_month_min, storm_name),
select(my_data, year_month = year_month_max, storm_name),
by = "year_month", suffix = c(".min", ".max")
) %>%
filter(storm_name.min != storm_name.max) %>%
add_count(year_month)
# year_month storm_name.min storm_name.max n
# 1 2007-08 FELIX_2007 DEAN_2007 1
# 2 2008-09 IKE_2008 GUSTAV_2008 1
# 3 2010-09 IGOR_2010 EARL_2010 1
# 4 2017-09 MARIA_2017 HARVEY_2017 2
# 5 2017-09 MARIA_2017 IRMA_2017 2
# 6 2019-09 LORENZO_2019 DORIAN_2019 1