I'm trying to create a line plot such that the x axis is Publication Date, and Y axis is the count of my categorical variable, Income. My data currently looks something like this
ID Pubdate Country Income
1 2020-01-01 USA High
2 2021-01-01 Canada High
3 2022-01-01 Bolivia Low-mid
I am hoping to have multiple colored lines, each a different Income level, with time on the x axis and frequency of the Income levels (ex. high = 30,000, low-mid = 10,000, low = 1,000) on the y axis. I've been trying different codes, such as
plot1 <- ggplot(lmic.plot, aes(x=Pubdate, y=Income))
geom_line()
I'm thinking I need to somehow turn the Income variable into a count, but I keep getting error messages when I try.
CodePudding user response:
You can use:
library(readr)
library(dplyr)
read_table('ID Pubdate Country Income
1 2020-01-01 USA High
2 2021-01-01 Canada High
3 2022-01-01 Bolivia Low-mid') -> df
df %>%
arrange(Pubdate) %>%
group_by(Income) %>%
mutate(cumulative_count = row_number())
# A tibble: 3 × 5
# Groups: Income [2]
ID Pubdate Country Income cumulative_count
<dbl> <date> <chr> <chr> <int>
1 1 2020-01-01 USA High 1
2 2 2021-01-01 Canada High 2
3 3 2022-01-01 Bolivia Low-mid 1
to get a count of how many of each Income
you have by a given Pubdate
and then plot cumulative_count
CodePudding user response:
Not finally sure if this is best practice (probably not), but a pipe consisting of group_by
, mutate
, select
and distinct
seems to do the job here.
library(dplyr)
library(ggplot2)
data <- data.frame(ID = c(1,2,3),
Pubdate = c("2020-01-01", "2020-01-01", "2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02") |> as.Date(),
Country = c("USA", "Canada", "Bolivia", "USA", "Canada", "Bolivia"),
Income = c("High", "High", "Low-mid", "High", "High", "High"))
data <- data |>
group_by(Pubdate, Income) |>
mutate(count = n()) |>
select(Pubdate, Income, count) |>
distinct(Pubdate, count)
g1 <- ggplot(data, aes(x = Pubdate, y = count, colour = Income))
geom_line()