I am trying to plot a geom_line
plot colored
by month
with tmean
at y-axis
and year
on x-axis
. When I try the code on a sample dataset it produces the intended plot. However, when I run the code on my data, it doesn't return what I want.
How can I get the desired plot?
Code output with sample data:
library(tidyverse)
sample_df = data.frame(
month = rep(1:6, each = 30),
year = rep(1980:2009, 6),
so2 = rnorm(180)
)
sample_df %>%
ggplot(aes(x = year, y = so2, color = as.character(month)))
geom_line()
ggtitle("Timeseries Plot By Month")
Output:
Code output with original data:
When I create the same desired plot on the same dataset in another software I get what I want (lines represent different plots in the below plot):
Desired output:
Update
Updated sample data
structure(list(tmean = c(31, 30, 32, 33, 34, 40, 41, 42, 43,
46, 33, 36, 39, 32, 38, 42, 44, 48, 52, 56), month = c(1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), year = c(1980,
1980, 1980, 1980, 1980, 1981, 1981, 1981, 1981, 1981, 1980, 1980,
1980, 1980, 1980, 1981, 1981, 1981, 1981, 1981)), row.names = c(NA,
-20L), class = "data.frame")
CodePudding user response:
Your data includes three values in the
Variable
column. You probably only want the ones whereVariable == "tmean"
, otherwise the data is all zeroes.After that filter, you have 58 observations for each year-month, so you need some aggregation.
This might do it:
df %>%
filter(Variable == "tmean") %>%
group_by(year, month) %>%
summarize(tmean = mean(tmean)) %>%
ggplot(aes(year, tmean, color = as.factor(month)))
geom_line()
How might you figure this out next time?
It's useful in data exploration, especially when you see something unusual, to investigate the data to see if it has the characteristics you should expect.
Your data has 88,000 rows. There are 42 different year values and 13 (?) different month values.
nrow(df)
[1] 87870
df %>%
summarize(across(c(year, month, Variable, tmean), n_distinct))
# A tibble: 1 × 4
year month Variable tmean
<int> <int> <int> <int>
1 42 13 3 11943
If you had one value for each year-month, you'd expect around 546 observations. But you have 88,000, so there about 160 observations for each year-month, across 3 Variables. We can see that only the situations with Variable == "tmean"
have any variation:
df %>%
group_by(Variable) %>%
summarize(n_values = n_distinct(tmean))
# A tibble: 3 × 2
Variable n_values
<chr> <int>
1 tmax 1
2 tmean 11942
3 tmin 1
We can also see that each year-month has 58 observations, and this is the case for all 505 year-months in the data.
df %>%
filter(Variable == "tmean") %>%
count(year, month, name = "yearmo_obs") %>%
add_count(yearmo_obs, name = "times_this_count")
# A tibble: 505 × 4
year month yearmo_obs times_this_count
<dbl> <dbl> <int> <int>
1 1980 0 58 505
2 1980 1 58 505
3 1980 2 58 505
4 1980 3 58 505
5 1980 4 58 505
6 1980 5 58 505
7 1980 6 58 505
8 1980 7 58 505
9 1980 8 58 505
10 1980 9 58 505
# … with 495 more rows
You can observe the same sort of issues if you sort the data. We'd see here that there are multiple tmean
values for points that have the same coordinates in the eventual plot:
> df %>%
arrange(year, month)
# A tibble: 87,870 × 7
...1 tmean Variable year month TMin TMax
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 3 24.4 tmean 1980 0 0 0
2 510 24.5 tmean 1980 0 0 0
3 511 24.5 tmean 1980 0 0 0
4 1518 24.6 tmean 1980 0 0 0
5 2027 24.2 tmean 1980 0 0 0
6 2028 24.3 tmean 1980 0 0 0
7 2029 24.2 tmean 1980 0 0 0
8 3544 24.3 tmean 1980 0 0 0
9 3545 24.3 tmean 1980 0 0 0
10 3546 24.3 tmean 1980 0 0 0
# … with 87,860 more rows
Those observations tell us we should filter by Variable and do some sort of aggregation if we want to show a single value per year-month with geom_line
.
CodePudding user response:
Thank you for the useful comments and the first answer, I also found another way of doing it so thought of adding it.
aggregate(tmean ~ month year, # Aggregate data
df,
FUN = mean) %>%
ggplot(aes(x = year, y = tmean, color = as.factor(month)))
geom_line()
ggtitle("Timeseries Plot By Month")