trying to make a ggplot with two lines-CodePudding

data can be found at: https://www.kaggle.com/tovarischsukhov/southparklines

SP = read.csv("/Users/michael/Desktop/stat 479 proj data/All-seasons.csv")
SP$Season = as.numeric(SP$Season)
SP$Episode = as.numeric(SP$Episode)

Clean.Boys = SP  %>% select(Season, Episode, Character) %>% 
  arrange(Season, Episode, Character) %>% 
  filter(Character == "Kenny"   | Character == "Cartman") %>% 
  group_by(Season, Episode) 

count = table(Clean.Boys)
count = as.data.frame(count)
Clean = count %>% pivot_wider(names_from = Character, values_from = Freq) %>% group_by(Episode)

Season Episode Cartman Kenny
  <fct>  <fct>     <int> <int>
1 1      1            85     5
2 2      1             1     0
3 3      1            43    19
4 4      1            83     6
5 5      1            37     3
6 6      1            67     0

I am trying to use ggplot to make a single plot with 2 lines on it one for the Cartman variable and one for the Kenny variable. My two questions are

is my data formated correctly to make a plot with geom_line()? or would I have to Pivot it longer?
I want to plot the X-scale as a continuous variable, similar to date but instead, it is season and episode. For example the first plotting point would be Season 1 Episode 1 then Season 1 Episode 2 and so on. I am stuck on how I would be able to do that with season and Episode being in separate columns and even if I combined them I'm not sure what the proper format would be.

CodePudding user response：

The trick is to gather the columns you want to map as variables. As I don't know, how you want to plot your graph, means, about x-axis and y-axis, I made a pseudo plot. and for your continuous variable part, you can either convert your values to integer or numeric using as.integer() or as.numeric(), then you can use as continuous scale. You can check your variable structure by calling str(df), which will show you the class of your variable, if it is in factor or character, convert them to numbers.

#libraries
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.5
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
#> Warning: package 'tidyr' was built under R version 4.0.3

#your code
SP <- read.csv("C:/Users/saura/Desktop/All-seasons.csv")
SP$Season = as.numeric(SP$Season)
#> Warning: NAs introduced by coercion
SP$Episode = as.numeric(SP$Episode)
#> Warning: NAs introduced by coercion

Clean.Boys = SP  %>% select(Season, Episode, Character) %>% 
  arrange(Season, Episode, Character) %>% 
  filter(Character == "Kenny"   | Character == "Cartman") %>% 
  group_by(Season, Episode) 
count = table(Clean.Boys)
count = as.data.frame(count)
Clean = count %>% pivot_wider(names_from = Character, values_from = Freq) %>% group_by(Episode)

#here is your code, but as I dont know, what you want on your axis
new_df <- Clean %>%
  gather(-Season,-Episode, key = "Views", value = "numbers")

ggplot(data = new_df, aes(
  as.numeric(Episode),
  numbers,
  color = Views,
  group = Views
))  
  geom_path()

^{Created on 2022-02-19 by the reprex package (v2.0.1)}

CodePudding user response：

In this example I've used readr::read_csv to read the file and set the variable types in the call to save doing this in separate lines of code.

The frequency count can be done with dplyr::summarise, within the piped workflow.

I'm not sure what you really mean by wanting to keep the season and episode data as a continuous variable - you'd have to be more explicit about how you want this to look. The approach I've taken is to provide a means of showing season and episode using minimal text: The order of season and episode are in numeric order by default, but when combined into a character they have to be coerced into numerical order by using factor. An alternative could be to facet by season.

ggplot likes to have data in long format, so there is no need to convert the data into wide format.

To keep the graph readable only the first 80 observations are shown.

library(readr)
library(dplyr)
library(ggplot2

SP <- read_csv("...your file path.../All-seasons.csv"col_types = "nncc")

Clean.Boys <- 
  SP  %>% 
  select(-Line) %>% 
  arrange(Season, Episode, Character) %>% 
  filter(Character == "Kenny"  | Character == "Cartman") %>% 
  group_by(Season, Episode, Character)%>% 
  summarise(count = n(), .groups = "keep") %>%
  mutate(x_lab = factor(paste(Season, Episode, sep = "\n"))) %>% 
  head(n = 80)

ggplot(Clean.Boys) 
  geom_line(aes(x_lab, count, group = Character, colour = Character)) 
  labs(x = "Season and episode")

^{Created on 2022-02-20 by the reprex package (v2.0.1)}