Home > Software engineering >  Calculate the median date based on two grouping conditions
Calculate the median date based on two grouping conditions

Time:04-06

I have the following data frame:

> head(df)
# A tibble: 6 x 6
# Groups:   lat, decade [2]
    lat  long date       year  decade    month_day
  <dbl> <dbl> <date>     <chr> <chr>     <chr>    
1    55    18 1952-02-03 1952  1950-1959 02-03    
2    55    18 1958-02-08 1958  1950-1959 02-08    
3    55    18 1958-02-08 1958  1950-1959 02-08    
4    55    18 1958-02-08 1958  1950-1959 02-08    
5    55    18 1965-02-07 1965  1960-1969 02-07    
6    55    18 1966-03-03 1966  1960-1969 03-03    
   


> summary(df)
          lat             long         date                year              decade         
     Min.   :55.00   Min.   :18   Min.   :1951-03-22   Length:1414        Length:1414       
     1st Qu.:56.00   1st Qu.:18   1st Qu.:1987-01-01   Class :character   Class :character  
     Median :58.00   Median :18   Median :2004-04-02   Mode  :character   Mode  :character  
     Mean   :59.07   Mean   :18   Mean   :1999-02-16                                        
     3rd Qu.:62.00   3rd Qu.:18   3rd Qu.:2014-01-01                                        
     Max.   :68.00   Max.   :18   Max.   :2021-03-28                                        
      month_day        
     Length:1414       
     Class :character  
     Mode  :character 

I would like to get the median month_day by degree of latitude (lat) and per decade

I have tried this but cannot get past an error:

df = df %>%
  group_by(lat, decade) %>%
  summarise(across(month_day, median)) %>%
  ungroup

Error in `summarise()`:
! Problem while computing `..1 = across(month_day, median)`.
Caused by error:
! `month_day` must return compatible vectors across groups.
i Result type for group 1 (lat = 55, decade = "1950-1959"): <double>.
i Result type for group 2 (lat = 55, decade = "1960-1969"): <character>.

I do not know how to solve it, thank you very much for your help.

EDIT:

> ds_filtered_median[ds_filtered_median$lat == '57', ]
# A tibble: 124 x 6
     lat  long date       year  decade    month_day
   <dbl> <dbl> <date>     <chr> <chr>     <chr>    
 1    57    18 1955-04-08 1955  1950-1959 04-08    
 2    57    18 1957-02-19 1957  1950-1959 02-19    
 3    57    18 1958-04-06 1958  1950-1959 04-06    
 4    57    18 1959-01-01 1959  1950-1959 01-01    
 5    57    18 1960-01-03 1960  1960-1969 01-03    
 6    57    18 1961-01-02 1961  1960-1969 01-02    
 7    57    18 1962-01-02 1962  1960-1969 01-02    
 8    57    18 1963-01-01 1963  1960-1969 01-01    
 9    57    18 1964-01-19 1964  1960-1969 01-19    
10    57    18 1965-01-12 1965  1960-1969 01-12    
# ... with 114 more rows

CodePudding user response:

What you can do is convert your date to days since the start of a year. From that number you can easily calculate your median. Then convert your days back with any first of january as a reference. You can me one of on leap years though... For date manipulation I used lubridate.

library(lubridate)

data %>%
  mutate(
    date = ymd(date),
    days_since_january = as.numeric(date - ymd(paste(year(date), 1, 1, sep = "-")))
  ) %>%
  group_by(lat, decade) %>%
  summarise(across(days_since_january, median), .groups = "keep") %>%
  mutate(median_month_date = format(ymd("1960-01-01")   days(floor(days_since_january)), "%m-%d"))


# A tibble: 2 x 4
# Groups:   lat, decade [2]
    lat decade    days_since_january median_month_date
  <dbl> <chr>                  <dbl> <chr>            
1    55 1950-1959                 38 02-08            
2    55 1960-1969                 49 02-19  

# A tibble: 2 x 4
# Groups:   lat, decade [2]
    lat decade    days_since_january median_month_date
  <int> <chr>                  <dbl> <chr>            
1    57 1950-1959               72   03-13            
2    57 1960-1969                1.5 01-02   

CodePudding user response:

You must convert month_day to numeric to get the median. across is only needed if something is calculated for multiple columns individually e.g. to get median lon and lat using data %>% summarise(across(any_of(c("lat", "long")), median))

library(tidyverse)

data <- tribble(
  ~lat, ~long, ~date, ~year, ~decade, ~month_day,
  55, 18, "1952-02-03", 1952, "1950-1959", "02-03",
  55, 18, "1958-02-08", 1958, "1950-1959", "02-08",
  55, 18, "1958-02-08", 1958, "1950-1959", "02-08",
  55, 18, "1958-02-08", 1958, "1950-1959", "02-08",
  55, 18, "1965-02-07", 1965, "1960-1969", "02-07",
  55, 18, "1966-03-03", 1966, "1960-1969", "03-03"
)

data %>%
  mutate(
    month_day_num = month_day %>% str_extract("[0-9] $") %>% as.numeric()
  ) %>%
  group_by(lat, decade) %>%
  summarise(
    median_month_day = median(month_day_num)
  )
#> `summarise()` has grouped output by 'lat'. You can override using the `.groups`
#> argument.
#> # A tibble: 2 × 3
#> # Groups:   lat [1]
#>     lat decade    median_month_day
#>   <dbl> <chr>                <dbl>
#> 1    55 1950-1959                8
#> 2    55 1960-1969                5

Created on 2022-04-05 by the reprex package (v2.0.0)

  • Related