I have a df similar to the following:
df <- data.frame(name = c("billy","billy","sarah","sarah","sarah","sarah","sarah","sarah","sarah","linda","linda","linda","linda","linda","linda"),
data = c("soccer","8-01-1992","basketball","soccer","10-19-1995","10-25-1995","basketball","11-24-1995",6,"hockey","soccer","basketball","12-21-2001","12-30-2001","1-19-2002"),
event = c("joins.project","joins.when","joins.project","joins.project1","joins.when","joins.when1","participation.project","participation.when","participation.repetitions","joins.project","joins.project1","joins.project2","joins.when","joins.when1","joins.when2"))
> df
name data event
1 billy soccer joins.project
2 billy 8-01-1992 joins.when
3 sarah basketball joins.project
4 sarah soccer joins.project1
5 sarah 10-19-1995 joins.when
6 sarah 10-25-1995 joins.when1
7 sarah basketball participation.project
8 sarah 11-24-1995 participation.when
9 sarah 6 participation.repetitions
10 linda hockey joins.project
11 linda soccer joins.project1
12 linda basketball joins.project2
13 linda 12-21-2001 joins.when
14 linda 12-30-2001 joins.when1
15 linda 1-19-2002 joins.when2
My desired output is below.
new_df <- data.frame(name = c("billy", "sarah", "sarah", "sarah", "linda", "linda", "linda"),
join_or_particip = c("join", "join", "join", "participate", "join", "join", "join"),
sport = c("soccer", "basketball", "soccer", "basketball", "hockey", "soccer", "basketball"),
when = c("8-01-1992", "10-19-1995", "10-25-1995", "11-24-1995", "12-21-2001", "12-30-2001", "1-19-2002"),
repetitions = c(NA, NA, NA, 6, NA, NA, NA))
new_df
name join_or_particip sport when repetitions
1 billy join soccer 8-01-1992 NA
2 sarah join basketball 10-19-1995 NA
3 sarah join soccer 10-25-1995 NA
4 sarah participate basketball 11-24-1995 6
5 linda join hockey 12-21-2001 NA
6 linda join soccer 12-30-2001 NA
7 linda join basketball 1-19-2002 NA
I imagine I need to use spread()
to achieve this but I'm not sure how to account for the fact that certain rows in df
are linked (i.e., "joins.project1" and "joins.when1" go together). A bit of further context: this is a dataset of people who "joined" or "participated" in between 1 and 200 "sports" on different dates. If they "participated" there is an additional variable called "repetitions" (i.e., how many times they participated).
CodePudding user response:
The person who designed the structure of that data has no business working with data.
library(data.table)
setDT(df)
df[, c("action", "type") := tstrsplit(event, "\\.")]
df[, c("type", "ID") := tstrsplit(type, "(?<=[A-Za-z])(?=[0-9])", perl = TRUE)]
df[is.na(ID), ID := 0]
res <- df[type == "project"]
#data.table joins:
res[df[type == "when"], when := i.data, on = .(name, action, ID)]
res[df[type == "repetitions"], repetitions := i.data, on = .(name, action, ID)]
res
# name data event action type ID when repetitions
# 1: billy soccer joins.project joins project 0 8-01-1992 <NA>
# 2: sarah basketball joins.project joins project 0 10-19-1995 <NA>
# 3: sarah soccer joins.project1 joins project 1 10-25-1995 <NA>
# 4: sarah basketball participation.project participation project 0 11-24-1995 6
# 5: linda hockey joins.project joins project 0 12-21-2001 <NA>
# 6: linda soccer joins.project1 joins project 1 12-30-2001 <NA>
# 7: linda basketball joins.project2 joins project 2 1-19-2002 <NA>
CodePudding user response:
dplyr
way
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate(event, into = c("join_or_particip", "type"), sep = "\\.") %>%
mutate(num = as.numeric(gsub('[^0-9.-]', '', type)) %>% replace_na(., 0),
type = gsub('[0-9] ', '', type)) %>%
pivot_wider(id_cols = c(name, join_or_particip, num), names_from = type, values_from = data) %>%
select(-num)
name join_or_particip project when repetitions
<chr> <chr> <chr> <chr> <chr>
1 billy joins soccer 8-01-1992 NA
2 sarah joins basketball 10-19-1995 NA
3 sarah joins soccer 10-25-1995 NA
4 sarah participation basketball 11-24-1995 6
5 linda joins hockey 12-21-2001 NA
6 linda joins soccer 12-30-2001 NA
7 linda joins basketball 1-19-2002 NA