I have a large data.frame of variables regarding salmon farming (2005-2020). It contains data from hundreds of different farms (org_anonym) for all 15 years. However, many farms are missing some months or have duplicate months. How can I write this so that every year for every location has 12 months in the order 1-12?
Example:
In this example, farm 126 is missing the 12th month of the year for 2005, whereas 2006 has only the 11th and 12th month. Sometimes the same year has two consecutive rows with the same month.
My desired outcome is to have all locations have years 2005-2020 with months 1-12 without duplicates or missing months (the data in the filled rows can be 0 or NA).
I don't have an intuitive way of doing this since the errors are random.
Please help :)
CodePudding user response:
here is a similar example where I only work with a "6 month year" so it's more readible. It's easier to sort stuff with small examples.
library(plyr) # join and arrange function
# Example where i miss last month of 2005
# and only have 2 first month for 2006
df <- data.frame(
year = c(rep(2005, 5), rep(2006, 2)),
month = c(1:5, 4, 5 ),
value = 126)
# I create a data.frame of all coombination of year and month.
# You can replace year here by 2005:2020 and month by 1:12
tmp <- expand.grid(year = unique(df$year), month = 1:6)
res <- join(tmp, df, type = "left") # join will add NA for missing values
res <- arrange(res, year, month, value) # arrange will order the rows.
res
# year month value
# 1 2005 1 126
# 2 2005 2 126
# 3 2005 3 126
# 4 2005 4 126
# 5 2005 5 126
# 6 2005 6 NA
# 7 2006 1 NA
# 8 2006 2 NA
# 9 2006 3 NA
# 10 2006 4 126
# 11 2006 5 126
# 12 2006 6 NA
A shorter way is to use the function complete()
, but this require that your dataframe has an occurence of at least each year and month. In my example this won't exactly work since I don't have any year with the "sixth" month.
library(tidyr)
df %>% complete(year, month)
Also complete()
is only a wrapper around expand()
and join()
, so it's better for you to understand what happens during the first solution.
CodePudding user response:
salmon %>% distinct(year) %>%
mutate(month = list(1:12)) %>%
unnest(month) %>%
left_join(salmon, by=c("year", "month")
If you want org_acronym
to also be carried over, just change above to distinct(year, acronym)
.
If you want all the years from 2005:2020, just change above to
tibble(year=2005:2020) %>%
mutate(month=list(1:12)) %>%
unnest(month) %>%
left_join(salmon)
Output:
year month org_acronym
1 2005 1 126
2 2005 2 126
3 2005 3 126
4 2005 4 126
5 2005 5 126
6 2005 6 126
7 2005 7 126
8 2005 8 126
9 2005 9 126
10 2005 10 126
11 2005 11 126
12 2005 12 NA
13 2006 1 NA
14 2006 2 NA
15 2006 3 NA
16 2006 4 NA
17 2006 5 NA
18 2006 6 NA
19 2006 7 NA
20 2006 8 NA
21 2006 9 NA
22 2006 10 NA
23 2006 11 126
24 2006 12 126
25 2007 1 126
26 2007 2 NA
27 2007 3 NA
28 2007 4 NA
29 2007 5 NA
30 2007 6 NA
31 2007 7 NA
32 2007 8 NA
33 2007 9 NA
34 2007 10 NA
35 2007 11 NA
36 2007 12 NA
Input:
salmon =tibble(
year = c(rep(2005,11), rep(2006,2),2007),
month = c(seq(1:11), 11,12,1),
org_acronym = 126
)