R ggplot 2 geom_line by year-CodePudding

I am trying to plot a geom_line plot colored by month with tmean at y-axis and year on x-axis. When I try the code on a sample dataset it produces the intended plot. However, when I run the code on my data, it doesn't return what I want.

How can I get the desired plot?

Code output with sample data:

library(tidyverse)

sample_df = data.frame(
    month = rep(1:6, each = 30),
    year = rep(1980:2009, 6),
    so2 = rnorm(180)
)

sample_df %>% 
               ggplot(aes(x = year, y = so2, color = as.character(month)))  
               geom_line()  
               ggtitle("Timeseries Plot By Month")

Output:

Code output with original data:

The data can be accessed

When I create the same desired plot on the same dataset in another software I get what I want (lines represent different plots in the below plot):

Desired output:

Update

Updated sample data

structure(list(tmean = c(31, 30, 32, 33, 34, 40, 41, 42, 43, 
46, 33, 36, 39, 32, 38, 42, 44, 48, 52, 56), month = c(1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), year = c(1980, 
1980, 1980, 1980, 1980, 1981, 1981, 1981, 1981, 1981, 1980, 1980, 
1980, 1980, 1980, 1981, 1981, 1981, 1981, 1981)), row.names = c(NA, 
-20L), class = "data.frame")

CodePudding user response：

Your data includes three values in the Variable column. You probably only want the ones where Variable == "tmean", otherwise the data is all zeroes.
After that filter, you have 58 observations for each year-month, so you need some aggregation.

This might do it:

df %>% 
  filter(Variable == "tmean") %>%
  group_by(year, month) %>%
  summarize(tmean = mean(tmean)) %>%
  ggplot(aes(year, tmean, color = as.factor(month)))  
  geom_line()

How might you figure this out next time?

It's useful in data exploration, especially when you see something unusual, to investigate the data to see if it has the characteristics you should expect.

Your data has 88,000 rows. There are 42 different year values and 13 (?) different month values.

nrow(df)
[1] 87870

df %>% 
  summarize(across(c(year, month, Variable, tmean), n_distinct))
# A tibble: 1 × 4
   year month Variable tmean
  <int> <int>    <int> <int>
1    42    13        3 11943

If you had one value for each year-month, you'd expect around 546 observations. But you have 88,000, so there about 160 observations for each year-month, across 3 Variables. We can see that only the situations with Variable == "tmean" have any variation:

df %>%
  group_by(Variable) %>%
  summarize(n_values = n_distinct(tmean))
# A tibble: 3 × 2
  Variable n_values
  <chr>       <int>
1 tmax            1
2 tmean       11942
3 tmin            1

We can also see that each year-month has 58 observations, and this is the case for all 505 year-months in the data.

df %>% 
  filter(Variable == "tmean") %>%
  count(year, month, name = "yearmo_obs") %>%
  add_count(yearmo_obs, name = "times_this_count")
# A tibble: 505 × 4
    year month yearmo_obs times_this_count
   <dbl> <dbl>      <int>            <int>
 1  1980     0         58              505
 2  1980     1         58              505
 3  1980     2         58              505
 4  1980     3         58              505
 5  1980     4         58              505
 6  1980     5         58              505
 7  1980     6         58              505
 8  1980     7         58              505
 9  1980     8         58              505
10  1980     9         58              505
# … with 495 more rows

You can observe the same sort of issues if you sort the data. We'd see here that there are multiple tmean values for points that have the same coordinates in the eventual plot:

>   df %>%
      arrange(year, month)
# A tibble: 87,870 × 7
    ...1 tmean Variable  year month  TMin  TMax
   <dbl> <dbl> <chr>    <dbl> <dbl> <dbl> <dbl>
 1     3  24.4 tmean     1980     0     0     0
 2   510  24.5 tmean     1980     0     0     0
 3   511  24.5 tmean     1980     0     0     0
 4  1518  24.6 tmean     1980     0     0     0
 5  2027  24.2 tmean     1980     0     0     0
 6  2028  24.3 tmean     1980     0     0     0
 7  2029  24.2 tmean     1980     0     0     0
 8  3544  24.3 tmean     1980     0     0     0
 9  3545  24.3 tmean     1980     0     0     0
10  3546  24.3 tmean     1980     0     0     0
# … with 87,860 more rows

Those observations tell us we should filter by Variable and do some sort of aggregation if we want to show a single value per year-month with geom_line.

CodePudding user response：

Thank you for the useful comments and the first answer, I also found another way of doing it so thought of adding it.

aggregate(tmean ~ month   year,       # Aggregate data
                           df,
                        FUN = mean) %>%
               ggplot(aes(x = year, y = tmean, color = as.factor(month)))  
               geom_line()  
               ggtitle("Timeseries Plot By Month")