I am working with dplyr and the data package 'storms'.
I need a table in which I have each measured storm in a column. Then I want to give each row an ID.
So far I have
storm_ID <- storms %>%
select(year,month,name) %>%
group_by(year,month,name) %>%
summarise(ID = n())
storm_ID
View(storm_ID)
The only thing is that it doesn't do anything for me.
I don't quite understand how I can see every single storm in the table. I had previously sorted them by name. Then I get 214 storms. However, storms with the same name occur in several years.
At the end I want something like:
name | year | month | day | ID
| | | | |
Zeta 2005 12 31 Zeta1
Zeta 2006 1 1 Zeta1
| | | | |
Zeta 2020 10 24 Zeta2
To do this, I need to know if a storm occurred in 2 years (i.e. from 2005-12-31 to 2006-01-01) But this should then only be counted as one storm.
After that I should then be able to evaluate the duration, wind speed difference and pressure difference per storm. What I had already evaluated with the wrong sorting.
Help would be nice.
Thanks in advance.
CodePudding user response:
For you first problem:
storm_ID <- storms %>%
select(year,month,name) %>%
group_by(year,month,name) %>%
mutate(ID = stringr::str_c(name, cur_group_id()))
This create a unique Storm-Name-ID, e.g. Amy1, Amy2 etc.
This is how you can check if a storm has happened in consecutive years
storms %>%
group_by(name) %>%
mutate(consec_helper = cumsum(c(1, diff(year) != 1))) %>%
group_by(name, consec_helper) %>%
filter(n() > 1)
I find this to be true only for Zeta
name year
<chr> <dbl>
1 Zeta 2005
2 Zeta 2006
CodePudding user response:
Thanks for your approaches, unfortunately, all are not the appropriate solution.
I asked my professor for help, he said I could start a query with a loop. (I didn't expect an answer) So I checked the names afterwards and looked if they change. The data set is sorted by date, for this reason Zeta does not occur consecutively if it is not the same storm.
My current solution is :
install.packages(dplyr)
library(dplyr)
id <- c()
j <- 1
k <- 1
for(i in storms$name) {
if(k-1 == 0){
id <- append(id, j)
k <- k 1
next
}
if(i != storms$name[k-1])
{
j <- j 1
}
id <- append(id, j)
k <- k 1
}
storms <- cbind(storms, id)
View(storms)
I have now manually checked the dataset and think it is the appropriate solution to my problem.
This brings me to 511 different storms. (As of 22-04-15)
Nevertheless, thank you for all the solutions, I appreciate it very much.
CodePudding user response:
If you count storms as one if they continue to the next day but there are no gaps, days without a storm of the same name, then the following code might be what you want.
The variable Thresh
is set to the maximum number of consecutive days for a storm to be counted as the same storm.
suppressPackageStartupMessages(library(dplyr))
data("storms", package = "dplyr")
Thresh <- 5
storms %>%
count(name, year, month, day) %>%
group_by(name) %>%
mutate(Date = as.Date(ISOdate(year, month, day)),
DDiff = c(0, diff(Date)) > Thresh,
DDiff = cumsum(DDiff)) %>%
group_by(name, DDiff) %>%
mutate(name = ifelse(DDiff > 0, paste(name, cur_group_id(), sep = "."), name)) %>%
ungroup() %>%
group_by(name) %>%
summarise(name = first(name),
year = first(year),
n = sum(n))
#> # A tibble: 512 x 3
#> name year n
#> <chr> <dbl> <int>
#> 1 AL011993 1993 8
#> 2 AL012000 2000 4
#> 3 AL021992 1992 5
#> 4 AL021994 1994 6
#> 5 AL021999 1999 4
#> 6 AL022000 2000 12
#> 7 AL022001 2001 5
#> 8 AL022003 2003 4
#> 9 AL022006 2006 5
#> 10 AL031987 1987 32
#> # ... with 502 more rows
Created on 2022-04-15 by the reprex package (v2.0.1)
Edit
After seeing the OP's answer, I have revised mine and they are now nearly identical.
The main difference is that, even if increasing the gap between days with records Thresh
to 5, storm Dorian
has 5 consecutive days without records, between July the 27th 2013 and August 2nd 2013. Still the same, it should be considered as one storm only. To have this result, increase Thresh
to an appropriate value, for instance, 30 (days) and the outputs now match.
I have left it like this to show this point and to show what the variable Thresh
is meant for.
In the code that follows, I assign the result of my code above to data.frame rui
and the result of the OP's is cbind
'ed with id
and piped to a count instruction. Then saved in storm_count
. The two outputs are compared for differences with anti_join
after removing the id from my name
column.
suppressPackageStartupMessages(library(dplyr))
data("storms", package = "dplyr")
Thresh <- 5
storms %>%
count(name, year, month, day) %>%
group_by(name) %>%
mutate(Date = as.Date(ISOdate(year, month, day)),
DDiff = c(0, diff(Date)) > Thresh,
DDiff = cumsum(DDiff)) %>%
group_by(name, DDiff) %>%
mutate(name = ifelse(DDiff > 0, paste(name, cur_group_id(), sep = "."), name)) %>%
ungroup() %>%
group_by(name) %>%
summarise(name = first(name),
year = first(year),
n = sum(n)) -> rui
id <- c()
j <- 1
k <- 1
for(i in storms$name) {
if(k-1 == 0){
id <- append(id, j)
k <- k 1
next
}
if(i != storms$name[k-1])
{
j <- j 1
}
id <- append(id, j)
k <- k 1
}
cbind(storms, id) %>%
count(name, id) -> storm_count
# two rows
anti_join(
rui %>% mutate(name = sub("\\.\\d $", "", name)),
storm_count,
by = c("name", "n")
)
#> # A tibble: 2 x 3
#> name year n
#> <chr> <dbl> <int>
#> 1 Dorian 2013 16
#> 2 Dorian 2013 4
# just one row
anti_join(
storm_count,
rui %>% mutate(name = sub("\\.\\d $", "", name)),
by = c("name", "n")
)
#> name id n
#> 1 Dorian 397 20
# see here the dates of 2013-07-27 and 2013-08-02
storms %>%
filter(name == "Dorian", year == 2013) %>%
count(name, year, month, day)
#> # A tibble: 7 x 5
#> name year month day n
#> <chr> <dbl> <dbl> <int> <int>
#> 1 Dorian 2013 7 23 1
#> 2 Dorian 2013 7 24 4
#> 3 Dorian 2013 7 25 4
#> 4 Dorian 2013 7 26 4
#> 5 Dorian 2013 7 27 3
#> 6 Dorian 2013 8 2 1
#> 7 Dorian 2013 8 3 3
Created on 2022-04-15 by the reprex package (v2.0.1)