Home > Mobile >  R - dplyr- Reducing data package 'storms'
R - dplyr- Reducing data package 'storms'

Time:04-15

I am working with dplyr and the data package 'storms'.

I need a table in which I have each measured storm in a column. Then I want to give each row an ID.

So far I have

storm_ID <- storms %>%
  select(year,month,name) %>% 
  group_by(year,month,name) %>% 
  summarise(ID = n())
storm_ID

View(storm_ID)

The only thing is that it doesn't do anything for me.

I don't quite understand how I can see every single storm in the table. I had previously sorted them by name. Then I get 214 storms. However, storms with the same name occur in several years.

At the end I want something like:

name   |   year   |   month   |   day   |    ID 

  |          |          |          |          |
Zeta       2005        12          31       Zeta1 
Zeta       2006        1           1        Zeta1
  |          |          |          |          |
Zeta       2020        10          24       Zeta2

To do this, I need to know if a storm occurred in 2 years (i.e. from 2005-12-31 to 2006-01-01) But this should then only be counted as one storm.

After that I should then be able to evaluate the duration, wind speed difference and pressure difference per storm. What I had already evaluated with the wrong sorting.

Help would be nice.

Thanks in advance.

CodePudding user response:

For you first problem:

storm_ID <- storms %>%
  select(year,month,name) %>% 
  group_by(year,month,name) %>%
  mutate(ID = stringr::str_c(name, cur_group_id()))

This create a unique Storm-Name-ID, e.g. Amy1, Amy2 etc.

This is how you can check if a storm has happened in consecutive years

storms %>%
  group_by(name) %>%
  mutate(consec_helper = cumsum(c(1, diff(year) != 1))) %>%
  group_by(name, consec_helper) %>%
  filter(n() > 1)

I find this to be true only for Zeta

 name   year
  <chr> <dbl>
1 Zeta   2005
2 Zeta   2006

CodePudding user response:

Thanks for your approaches, unfortunately, all are not the appropriate solution.

I asked my professor for help, he said I could start a query with a loop. (I didn't expect an answer) So I checked the names afterwards and looked if they change. The data set is sorted by date, for this reason Zeta does not occur consecutively if it is not the same storm.

My current solution is :

install.packages(dplyr)
library(dplyr)

id <- c()
j <- 1
k <- 1
for(i in storms$name) {
  if(k-1 == 0){
    id <- append(id, j)
    k <- k 1
    next
  }
  if(i != storms$name[k-1])
  {
    j <- j 1
  }
  id <- append(id, j)
  k <- k 1
}
storms <- cbind(storms, id)
View(storms)

I have now manually checked the dataset and think it is the appropriate solution to my problem.

This brings me to 511 different storms. (As of 22-04-15)

Nevertheless, thank you for all the solutions, I appreciate it very much.

CodePudding user response:

If you count storms as one if they continue to the next day but there are no gaps, days without a storm of the same name, then the following code might be what you want.
The variable Thresh is set to the maximum number of consecutive days for a storm to be counted as the same storm.

suppressPackageStartupMessages(library(dplyr))

data("storms", package = "dplyr")

Thresh <- 5

storms %>%
  count(name, year, month, day) %>%
  group_by(name) %>%
  mutate(Date = as.Date(ISOdate(year, month, day)),
         DDiff = c(0, diff(Date)) > Thresh,
         DDiff = cumsum(DDiff)) %>%
  group_by(name, DDiff) %>%
  mutate(name = ifelse(DDiff > 0, paste(name, cur_group_id(), sep = "."), name)) %>%
  ungroup() %>%
  group_by(name) %>%
  summarise(name = first(name),
            year = first(year),
            n = sum(n))
#> # A tibble: 512 x 3
#>    name      year     n
#>    <chr>    <dbl> <int>
#>  1 AL011993  1993     8
#>  2 AL012000  2000     4
#>  3 AL021992  1992     5
#>  4 AL021994  1994     6
#>  5 AL021999  1999     4
#>  6 AL022000  2000    12
#>  7 AL022001  2001     5
#>  8 AL022003  2003     4
#>  9 AL022006  2006     5
#> 10 AL031987  1987    32
#> # ... with 502 more rows

Created on 2022-04-15 by the reprex package (v2.0.1)


Edit

After seeing the OP's answer, I have revised mine and they are now nearly identical.

The main difference is that, even if increasing the gap between days with records Thresh to 5, storm Dorian has 5 consecutive days without records, between July the 27th 2013 and August 2nd 2013. Still the same, it should be considered as one storm only. To have this result, increase Thresh to an appropriate value, for instance, 30 (days) and the outputs now match.

I have left it like this to show this point and to show what the variable Thresh is meant for.

In the code that follows, I assign the result of my code above to data.frame rui and the result of the OP's is cbind'ed with id and piped to a count instruction. Then saved in storm_count. The two outputs are compared for differences with anti_join after removing the id from my name column.

suppressPackageStartupMessages(library(dplyr))

data("storms", package = "dplyr")

Thresh <- 5

storms %>%
  count(name, year, month, day) %>%
  group_by(name) %>%
  mutate(Date = as.Date(ISOdate(year, month, day)),
         DDiff = c(0, diff(Date)) > Thresh,
         DDiff = cumsum(DDiff)) %>%
  group_by(name, DDiff) %>%
  mutate(name = ifelse(DDiff > 0, paste(name, cur_group_id(), sep = "."), name)) %>%
  ungroup() %>%
  group_by(name) %>%
  summarise(name = first(name),
            year = first(year),
            n = sum(n)) -> rui


id <- c()
j <- 1
k <- 1
for(i in storms$name) {
  if(k-1 == 0){
    id <- append(id, j)
    k <- k 1
    next
  }
  if(i != storms$name[k-1])
  {
    j <- j 1
  }
  id <- append(id, j)
  k <- k 1
}

cbind(storms, id) %>% 
  count(name, id) -> storm_count

# two rows
anti_join(
  rui %>% mutate(name = sub("\\.\\d $", "", name)), 
  storm_count,
  by = c("name", "n")
)
#> # A tibble: 2 x 3
#>   name    year     n
#>   <chr>  <dbl> <int>
#> 1 Dorian  2013    16
#> 2 Dorian  2013     4

# just one row
anti_join(
  storm_count,
  rui %>% mutate(name = sub("\\.\\d $", "", name)),
  by = c("name", "n")
)
#>     name  id  n
#> 1 Dorian 397 20

# see here the dates of 2013-07-27 and 2013-08-02
storms %>%
  filter(name == "Dorian", year == 2013) %>%
  count(name, year, month, day)
#> # A tibble: 7 x 5
#>   name    year month   day     n
#>   <chr>  <dbl> <dbl> <int> <int>
#> 1 Dorian  2013     7    23     1
#> 2 Dorian  2013     7    24     4
#> 3 Dorian  2013     7    25     4
#> 4 Dorian  2013     7    26     4
#> 5 Dorian  2013     7    27     3
#> 6 Dorian  2013     8     2     1
#> 7 Dorian  2013     8     3     3

Created on 2022-04-15 by the reprex package (v2.0.1)

  • Related