Home > Blockchain >  How to Generate a Week-Year Sequence
How to Generate a Week-Year Sequence

Time:05-26

I have a large dataframe broken into the following columns: "country", "year", "week", "conflict_event_count". The data records the country, year, and week in which a conflict event occurred; however, I'd also like to create rows for the weeks in which a conflict event didn't take place and assign a "0" to that week's event count. I.e., make implicit missing country-weeks explicit. This is how my dataframe currently looks:

Country <- c('Afghanistan', 'Afghanistan', 'Afghanistan', 'Algeria', 'Algeria')
Year <- c(1989, 1995, 2019, 1989, 1992)
Week <- c(24, 46, 52, 2, 12)
Conflict_Event_Count <- c(4, 3, 1, 3, 2)
df <- data.frame(Country, Year, Week, Conflict_Event_Count)
df

enter image description here

I'm trying to get the 'year' column to cycle through 1989:2019, and for the week column to cycle through 1:53 for each of the 53 weeks in a year. For example, instead of there being a gap between the third and fourth rows in the above dataframe, it would look something more like this: enter image description here

where the existing event counts remain, but the conflict event counts for the missing country-weeks receive a "0".

I tried to start with the 'week' column and used the complete function to fill in the missing week values

df %>% complete(week = seq(min(week), max(week), by = 'week'))

but got an error message because the 'country' column is non-numeric. Does anyone have any ideas about how to generate this sequence? Thanks in advance!

CodePudding user response:

The Country adds an additional layer of complexity. If you are not going to do this separately for each country then you will need to rep the country by a value that is the product of the year span and 53. And the Year-Week id's would need to be rep-ped by number of unique Country values. (I'm not sure this is the correct way to proced. This may be an X-Y problem: do X to get Y when there might be a more efficient Z method to get to Y. You have not told us what the analysis is going to be.)

For one country: I would create a text (character) field (column) that has the year in YYYY format, a delimiter, and append a week (using the caps in your code block rather than the spelling in pictures):

 new_col <- paste( rep( min(df$Year):max(df$Year), each=53), "_", 1:53)
 df$curr_wks <- paste(df$Year, "_", df$Week)
 expand_df <- merge(df, new_col, all=TRUE)
 is.na(expand_df$confict_event_count) <- 0

CodePudding user response:

library(tidyverse)

df %>%
    group_by(Country) %>%
    complete(Year = 1989:2019, Week = 1:52,
             fill = list(Conflict_Event_Count = 0))

# A tibble: 3,224 x 4
# Groups:   Country [2]
   Country      Year  Week Conflict_Event_Count
   <chr>       <dbl> <dbl>                <dbl>
 1 Afghanistan  1989     1                    0
 2 Afghanistan  1989     2                    0
 3 Afghanistan  1989     3                    0
 4 Afghanistan  1989     4                    0
 5 Afghanistan  1989     5                    0
 6 Afghanistan  1989     6                    0
 7 Afghanistan  1989     7                    0
 8 Afghanistan  1989     8                    0
 9 Afghanistan  1989     9                    0
10 Afghanistan  1989    10                    0

CodePudding user response:

In base R, consider expand.grid for all possible combinations of the three columns. Then, left join merge your original data:

all_country_year_week <- expand.grid(
    Country = unique(original_df$Country),
    Year = 1989:2019,
    Week = 1:53
)

conflict_events_df <- merge(
    all_country_year_week,
    original_df,
    by = c("Country", "Year", "Week")
    all.x = TRUE
) |> transform(
    Conflict_Event_Count = ifelse(
        is.na(Conflict_Event_Count), 0, Conflict_Event_Count
    )
)

CodePudding user response:

If you want to use baseR and seq you could use seq()

df$YWd<-paste(df$Year,df$Week,1,sep="-")
df$date<-as.Date(df$YW,format="%Y-%U-%u") # or other definition of "yearweek"

timeframe<-seq(min(df$date),max(df$date),by="week")
padding<-data.frame(list(date=timeframe)) # create full timeseries
# add Countries
expand.grid(Country=unique(df$Country), date=timeframe)

# merge df’s
padded_df<-merge( 
         padding,
         df[order(df$date),c("date","Country","Conflict_Event_Count")],
         all=T )
# Na to 0
is.na(padded_df$confict_event_count) <- 0

The benefit of this approach is that you have a column of class(Date) which is usefull for timeseries stuff (Arima, decomposition,...)

  • Related