Line graph of time and income (categorical) count-CodePudding

I'm trying to create a line plot such that the x axis is Publication Date, and Y axis is the count of my categorical variable, Income. My data currently looks something like this

ID       Pubdate              Country      Income
1        2020-01-01           USA          High
2        2021-01-01           Canada       High
3        2022-01-01           Bolivia      Low-mid

I am hoping to have multiple colored lines, each a different Income level, with time on the x axis and frequency of the Income levels (ex. high = 30,000, low-mid = 10,000, low = 1,000) on the y axis. I've been trying different codes, such as

plot1 <- ggplot(lmic.plot, aes(x=Pubdate, y=Income))   
         geom_line()

I'm thinking I need to somehow turn the Income variable into a count, but I keep getting error messages when I try.

CodePudding user response：

You can use:

library(readr)
library(dplyr)
read_table('ID       Pubdate              Country      Income
            1        2020-01-01           USA          High
            2        2021-01-01           Canada       High
            3        2022-01-01           Bolivia      Low-mid') -> df

df %>% 
  arrange(Pubdate) %>% 
  group_by(Income) %>% 
  mutate(cumulative_count = row_number())

# A tibble: 3 × 5
# Groups:   Income [2]
     ID Pubdate    Country Income  cumulative_count
  <dbl> <date>     <chr>   <chr>              <int>
1     1 2020-01-01 USA     High                   1
2     2 2021-01-01 Canada  High                   2
3     3 2022-01-01 Bolivia Low-mid                1

to get a count of how many of each Income you have by a given Pubdate and then plot cumulative_count

CodePudding user response：

Not finally sure if this is best practice (probably not), but a pipe consisting of group_by, mutate, select and distinct seems to do the job here.

library(dplyr)
library(ggplot2)

data <- data.frame(ID = c(1,2,3),
                   Pubdate = c("2020-01-01", "2020-01-01", "2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02") |> as.Date(),
                   Country = c("USA", "Canada", "Bolivia", "USA", "Canada", "Bolivia"),
                   Income = c("High", "High", "Low-mid", "High", "High", "High"))

data <- data |> 
  group_by(Pubdate, Income) |>
  mutate(count = n()) |> 
  select(Pubdate, Income, count) |> 
  distinct(Pubdate, count)

g1 <- ggplot(data, aes(x = Pubdate, y = count, colour = Income))   
  geom_line()