Home > other >  Making every year contain 12 months in a large dataset
Making every year contain 12 months in a large dataset

Time:03-10

I have a large data.frame of variables regarding salmon farming (2005-2020). It contains data from hundreds of different farms (org_anonym) for all 15 years. However, many farms are missing some months or have duplicate months. How can I write this so that every year for every location has 12 months in the order 1-12?

Example:

enter image description here

In this example, farm 126 is missing the 12th month of the year for 2005, whereas 2006 has only the 11th and 12th month. Sometimes the same year has two consecutive rows with the same month.

My desired outcome is to have all locations have years 2005-2020 with months 1-12 without duplicates or missing months (the data in the filled rows can be 0 or NA).

I don't have an intuitive way of doing this since the errors are random.

Please help :)

CodePudding user response:

here is a similar example where I only work with a "6 month year" so it's more readible. It's easier to sort stuff with small examples.

library(plyr) # join and arrange function
# Example where i miss last month of 2005 
# and only have 2 first month for 2006
df <- data.frame( 
  year = c(rep(2005, 5), rep(2006, 2)), 
  month = c(1:5, 4, 5 ),
  value = 126)
# I create a data.frame of all coombination of year and month.
# You can replace year here by 2005:2020 and month by 1:12
tmp <- expand.grid(year = unique(df$year), month = 1:6) 

res <- join(tmp, df, type = "left") # join will add NA for missing values
res <- arrange(res, year, month, value) # arrange will order the rows.
res
#    year month value
# 1  2005     1   126
# 2  2005     2   126
# 3  2005     3   126
# 4  2005     4   126
# 5  2005     5   126
# 6  2005     6    NA
# 7  2006     1    NA
# 8  2006     2    NA
# 9  2006     3    NA
# 10 2006     4   126
# 11 2006     5   126
# 12 2006     6    NA

A shorter way is to use the function complete(), but this require that your dataframe has an occurence of at least each year and month. In my example this won't exactly work since I don't have any year with the "sixth" month.

library(tidyr)
df %>% complete(year, month)

Also complete() is only a wrapper around expand() and join(), so it's better for you to understand what happens during the first solution.

CodePudding user response:

salmon %>% distinct(year) %>%
  mutate(month = list(1:12)) %>% 
  unnest(month) %>% 
  left_join(salmon, by=c("year", "month")

If you want org_acronym to also be carried over, just change above to distinct(year, acronym).

If you want all the years from 2005:2020, just change above to

tibble(year=2005:2020) %>% 
  mutate(month=list(1:12)) %>% 
  unnest(month) %>% 
  left_join(salmon)

Output:

   year month org_acronym
1  2005     1         126
2  2005     2         126
3  2005     3         126
4  2005     4         126
5  2005     5         126
6  2005     6         126
7  2005     7         126
8  2005     8         126
9  2005     9         126
10 2005    10         126
11 2005    11         126
12 2005    12          NA
13 2006     1          NA
14 2006     2          NA
15 2006     3          NA
16 2006     4          NA
17 2006     5          NA
18 2006     6          NA
19 2006     7          NA
20 2006     8          NA
21 2006     9          NA
22 2006    10          NA
23 2006    11         126
24 2006    12         126
25 2007     1         126
26 2007     2          NA
27 2007     3          NA
28 2007     4          NA
29 2007     5          NA
30 2007     6          NA
31 2007     7          NA
32 2007     8          NA
33 2007     9          NA
34 2007    10          NA
35 2007    11          NA
36 2007    12          NA

Input:

salmon  =tibble(
  year = c(rep(2005,11), rep(2006,2),2007),
  month = c(seq(1:11), 11,12,1),
  org_acronym = 126
)
  • Related