How to Generate a Week-Year Sequence-CodePudding

I have a large dataframe broken into the following columns: "country", "year", "week", "conflict_event_count". The data records the country, year, and week in which a conflict event occurred; however, I'd also like to create rows for the weeks in which a conflict event didn't take place and assign a "0" to that week's event count. I.e., make implicit missing country-weeks explicit. This is how my dataframe currently looks:

Country <- c('Afghanistan', 'Afghanistan', 'Afghanistan', 'Algeria', 'Algeria')
Year <- c(1989, 1995, 2019, 1989, 1992)
Week <- c(24, 46, 52, 2, 12)
Conflict_Event_Count <- c(4, 3, 1, 3, 2)
df <- data.frame(Country, Year, Week, Conflict_Event_Count)
df

I'm trying to get the 'year' column to cycle through 1989:2019, and for the week column to cycle through 1:53 for each of the 53 weeks in a year. For example, instead of there being a gap between the third and fourth rows in the above dataframe, it would look something more like this:

where the existing event counts remain, but the conflict event counts for the missing country-weeks receive a "0".

I tried to start with the 'week' column and used the complete function to fill in the missing week values

df %>% complete(week = seq(min(week), max(week), by = 'week'))

but got an error message because the 'country' column is non-numeric. Does anyone have any ideas about how to generate this sequence? Thanks in advance!

CodePudding user response：

The Country adds an additional layer of complexity. If you are not going to do this separately for each country then you will need to rep the country by a value that is the product of the year span and 53. And the Year-Week id's would need to be rep-ped by number of unique Country values. (I'm not sure this is the correct way to proced. This may be an X-Y problem: do X to get Y when there might be a more efficient Z method to get to Y. You have not told us what the analysis is going to be.)

For one country: I would create a text (character) field (column) that has the year in YYYY format, a delimiter, and append a week (using the caps in your code block rather than the spelling in pictures):

 new_col <- paste( rep( min(df$Year):max(df$Year), each=53), "_", 1:53)
 df$curr_wks <- paste(df$Year, "_", df$Week)
 expand_df <- merge(df, new_col, all=TRUE)
 is.na(expand_df$confict_event_count) <- 0

CodePudding user response：

library(tidyverse)

df %>%
    group_by(Country) %>%
    complete(Year = 1989:2019, Week = 1:52,
             fill = list(Conflict_Event_Count = 0))

# A tibble: 3,224 x 4
# Groups:   Country [2]
   Country      Year  Week Conflict_Event_Count
   <chr>       <dbl> <dbl>                <dbl>
 1 Afghanistan  1989     1                    0
 2 Afghanistan  1989     2                    0
 3 Afghanistan  1989     3                    0
 4 Afghanistan  1989     4                    0
 5 Afghanistan  1989     5                    0
 6 Afghanistan  1989     6                    0
 7 Afghanistan  1989     7                    0
 8 Afghanistan  1989     8                    0
 9 Afghanistan  1989     9                    0
10 Afghanistan  1989    10                    0

CodePudding user response：

In base R, consider expand.grid for all possible combinations of the three columns. Then, left join merge your original data:

all_country_year_week <- expand.grid(
    Country = unique(original_df$Country),
    Year = 1989:2019,
    Week = 1:53
)

conflict_events_df <- merge(
    all_country_year_week,
    original_df,
    by = c("Country", "Year", "Week")
    all.x = TRUE
) |> transform(
    Conflict_Event_Count = ifelse(
        is.na(Conflict_Event_Count), 0, Conflict_Event_Count
    )
)

CodePudding user response：

If you want to use baseR and seq you could use seq()

df$YWd<-paste(df$Year,df$Week,1,sep="-")
df$date<-as.Date(df$YW,format="%Y-%U-%u") # or other definition of "yearweek"

timeframe<-seq(min(df$date),max(df$date),by="week")
padding<-data.frame(list(date=timeframe)) # create full timeseries
# add Countries
expand.grid(Country=unique(df$Country), date=timeframe)

# merge df’s
padded_df<-merge( 
         padding,
         df[order(df$date),c("date","Country","Conflict_Event_Count")],
         all=T )
# Na to 0
is.na(padded_df$confict_event_count) <- 0

The benefit of this approach is that you have a column of class(Date) which is usefull for timeseries stuff (Arima, decomposition,...)