R extremely long to wide with linked rows-CodePudding

I have a df similar to the following:

df <- data.frame(name = c("billy","billy","sarah","sarah","sarah","sarah","sarah","sarah","sarah","linda","linda","linda","linda","linda","linda"),
                 data = c("soccer","8-01-1992","basketball","soccer","10-19-1995","10-25-1995","basketball","11-24-1995",6,"hockey","soccer","basketball","12-21-2001","12-30-2001","1-19-2002"),
                 event = c("joins.project","joins.when","joins.project","joins.project1","joins.when","joins.when1","participation.project","participation.when","participation.repetitions","joins.project","joins.project1","joins.project2","joins.when","joins.when1","joins.when2"))

> df
    name       data                     event
1  billy     soccer             joins.project
2  billy  8-01-1992                joins.when
3  sarah basketball             joins.project
4  sarah     soccer            joins.project1
5  sarah 10-19-1995                joins.when
6  sarah 10-25-1995               joins.when1
7  sarah basketball     participation.project
8  sarah 11-24-1995        participation.when
9  sarah          6 participation.repetitions
10 linda     hockey             joins.project
11 linda     soccer            joins.project1
12 linda basketball            joins.project2
13 linda 12-21-2001                joins.when
14 linda 12-30-2001               joins.when1
15 linda  1-19-2002               joins.when2

My desired output is below.

new_df <- data.frame(name = c("billy", "sarah", "sarah", "sarah", "linda", "linda", "linda"),
                     join_or_particip = c("join", "join", "join", "participate", "join", "join", "join"),
                     sport = c("soccer", "basketball", "soccer", "basketball", "hockey", "soccer", "basketball"),
                     when = c("8-01-1992", "10-19-1995", "10-25-1995", "11-24-1995", "12-21-2001", "12-30-2001", "1-19-2002"),
                     repetitions = c(NA, NA, NA, 6, NA, NA, NA))
new_df
   name join_or_particip      sport       when repetitions
1 billy             join     soccer  8-01-1992          NA
2 sarah             join basketball 10-19-1995          NA
3 sarah             join     soccer 10-25-1995          NA
4 sarah      participate basketball 11-24-1995           6
5 linda             join     hockey 12-21-2001          NA
6 linda             join     soccer 12-30-2001          NA
7 linda             join basketball  1-19-2002          NA

I imagine I need to use spread() to achieve this but I'm not sure how to account for the fact that certain rows in df are linked (i.e., "joins.project1" and "joins.when1" go together). A bit of further context: this is a dataset of people who "joined" or "participated" in between 1 and 200 "sports" on different dates. If they "participated" there is an additional variable called "repetitions" (i.e., how many times they participated).

CodePudding user response：

The person who designed the structure of that data has no business working with data.

library(data.table)
setDT(df)
df[, c("action", "type") := tstrsplit(event, "\\.")]
df[, c("type", "ID") := tstrsplit(type, "(?<=[A-Za-z])(?=[0-9])", perl = TRUE)]
df[is.na(ID), ID := 0]

res <- df[type == "project"]

#data.table joins:
res[df[type == "when"], when := i.data, on = .(name, action, ID)]
res[df[type == "repetitions"], repetitions := i.data, on = .(name, action, ID)]
res
#      name       data                 event        action    type ID       when repetitions
#  1: billy     soccer         joins.project         joins project  0  8-01-1992        <NA>
#  2: sarah basketball         joins.project         joins project  0 10-19-1995        <NA>
#  3: sarah     soccer        joins.project1         joins project  1 10-25-1995        <NA>
#  4: sarah basketball participation.project participation project  0 11-24-1995           6
#  5: linda     hockey         joins.project         joins project  0 12-21-2001        <NA>
#  6: linda     soccer        joins.project1         joins project  1 12-30-2001        <NA>
#  7: linda basketball        joins.project2         joins project  2  1-19-2002        <NA>

CodePudding user response：

dplyr way

library(dplyr)
library(tidyr)
library(stringr)

df %>%
  separate(event, into = c("join_or_particip", "type"), sep = "\\.") %>%
  mutate(num = as.numeric(gsub('[^0-9.-]', '', type)) %>% replace_na(., 0),
         type = gsub('[0-9] ', '', type)) %>%
  pivot_wider(id_cols = c(name, join_or_particip, num), names_from = type, values_from = data) %>%
  select(-num)


  name  join_or_particip project    when       repetitions
  <chr> <chr>            <chr>      <chr>      <chr>      
1 billy joins            soccer     8-01-1992  NA         
2 sarah joins            basketball 10-19-1995 NA         
3 sarah joins            soccer     10-25-1995 NA         
4 sarah participation    basketball 11-24-1995 6          
5 linda joins            hockey     12-21-2001 NA         
6 linda joins            soccer     12-30-2001 NA         
7 linda joins            basketball 1-19-2002  NA