I have a dataset with information on where individuals work over time, where time is defined as year/month (and shown as numeric values YYYYMM in my dataset). I run a ggplot to visualise how long individuals stay in a given workplace as well as how they move around. I used position_dodge
to make it visible when the same individual works in more than one place during the same month.
In the simple example below:
- individual A works in place 1 from Jan/2012 (i.e., 201201) until Dec/2012
- individual B works in place 2 from Jan/2012 until Jun/2012 and then switches to place 2 from Jul/2012 until Nov/2012
- individual C works in place 1 from Jan/2012 until Apr/2012 and in place 2 from Feb/2012 until Jun/2012
- individual D works in place 1 only during Jan/2012
My query is related to how to use time intervals. In my dataset, the time period variable refers to the entire month. For instance, individual A actually works in workplace 1 from 01/01/2012 until 31/12/2012 and individual D works in workplace 1 from 01/01/2012 until 31/01/2012.
# individual A
a_id <- c(rep('A',12))
a_period <- c(seq(201201, 201212))
a_workplace <-c(rep(1,12))
# individual B
b_id <- c(rep('B',11))
b_period <- c(seq(201201,201206), seq(201207,201211))
b_workplace <-c(rep(1,6), rep(2,5))
# individual C
c_id <- c(rep('C',9))
c_period <- c(seq(201201,201204), seq(201202,201206))
c_workplace <-c(rep(1,4), rep(2,5))
# individual D
d_id <- c(rep('D',1))
d_period <- c(seq(201201,201201))
d_workplace <-c(rep(1,1))
# final data frame
id <- c(a_id, b_id, c_id, d_id)
period <- c(a_period, b_period, c_period, d_period)
workplace <- as.factor(c(a_workplace, b_workplace, c_workplace, d_workplace))
mydata <- data.frame(id, period, workplace)
ggplot(mydata, aes(x = id, y = period, color = workplace))
geom_line(position = position_dodge(width = 0.1), size = 2)
scale_x_discrete(limits = rev)
scale_y_continuous(breaks = seq(201201, 201212, by = 1))
coord_flip()
theme(axis.text.x = element_text(angle=45, hjust=1),
legend.position = c(.8, .2),
legend.direction = "vertical",
legend.background = element_rect(linetype = "solid", colour = "black"),
panel.background = element_rect(fill = "grey97"))
labs(y = "time", title = "Work affiliation")
The ggplot above considers year/month as a single point in time. For instance, it shows no working history for individual D. How do I consider each consecutive sequence at the individual-workplace level to begin on the first day of the first month & end on the last day of the last month of the consecutive sequence. I also would like also to convert the year/month variable from numeric to date format to make manipulation easier.
PS: I highlight each consecutive sequence in the paragraph above because the same individual may work in a given place for a few months, leave for a time period and then return to working again in this same workplace later on. In these cases, the two time intervals the individual work in this given workplace should be considered separately.
CodePudding user response:
Here is a possible solution: Logic: Data tweaking:
- split your integer
period
inyear
andmonth
- add a '01' as
day
column andmake_date
of these columns (year, month, day) - group
- summarise
- create new column with
unite
to combineid
workplace
plotting:
- Use
geom_segment
with start and end point. The drawback aka know there is not jittering ingeom_segment
therefore we created the theid_workplace
column before - Using
date_breaks
we choosemonth
and label as month and year.
Note D
has start and end same month therefore no line.
library(tidyverse)
library(lubridate) # make_date function
library(scales) # date_breaks function
mydata %>%
separate(period, into = c("year", "month"), sep = -2) %>%
mutate(day = '01',
period = make_date(year, month, day)) %>%
group_by(id, workplace) %>%
summarise(start = first(period), end=last(period), .groups = "drop") %>%
unite(id_workplace, c(id, workplace), remove = FALSE) %>%
ggplot()
geom_segment(aes(x = start, y = id_workplace, xend = end, yend = id_workplace,
col= workplace, ), size = 2, position = position_dodge(width = 0.1))
scale_x_date( breaks=date_breaks("month"), date_labels = "%b-%Y")
xlab('Time')
theme(axis.text.x = element_text(vjust = 0.5))
CodePudding user response:
for the second question regarding conversion of numeric into date type i've got an answer:
library(lubridate) # handling and conversion of datetype
lubridate::ymd() # turns your numeric into a date
as.Date() #turns your characterstring into date type which is by the way the
#proper way you should handover timerelated data to ggplot
that should do it for your code:
mydata$period=lubridate::ymd(mydata[,2])