Home > Back-end >  R extremely long to wide with linked rows
R extremely long to wide with linked rows

Time:11-23

I have a df similar to the following:

df <- data.frame(name = c("billy","billy","sarah","sarah","sarah","sarah","sarah","sarah","sarah","linda","linda","linda","linda","linda","linda"),
                 data = c("soccer","8-01-1992","basketball","soccer","10-19-1995","10-25-1995","basketball","11-24-1995",6,"hockey","soccer","basketball","12-21-2001","12-30-2001","1-19-2002"),
                 event = c("joins.project","joins.when","joins.project","joins.project1","joins.when","joins.when1","participation.project","participation.when","participation.repetitions","joins.project","joins.project1","joins.project2","joins.when","joins.when1","joins.when2"))

> df
    name       data                     event
1  billy     soccer             joins.project
2  billy  8-01-1992                joins.when
3  sarah basketball             joins.project
4  sarah     soccer            joins.project1
5  sarah 10-19-1995                joins.when
6  sarah 10-25-1995               joins.when1
7  sarah basketball     participation.project
8  sarah 11-24-1995        participation.when
9  sarah          6 participation.repetitions
10 linda     hockey             joins.project
11 linda     soccer            joins.project1
12 linda basketball            joins.project2
13 linda 12-21-2001                joins.when
14 linda 12-30-2001               joins.when1
15 linda  1-19-2002               joins.when2

My desired output is below.

new_df <- data.frame(name = c("billy", "sarah", "sarah", "sarah", "linda", "linda", "linda"),
                     join_or_particip = c("join", "join", "join", "participate", "join", "join", "join"),
                     sport = c("soccer", "basketball", "soccer", "basketball", "hockey", "soccer", "basketball"),
                     when = c("8-01-1992", "10-19-1995", "10-25-1995", "11-24-1995", "12-21-2001", "12-30-2001", "1-19-2002"),
                     repetitions = c(NA, NA, NA, 6, NA, NA, NA))
new_df
   name join_or_particip      sport       when repetitions
1 billy             join     soccer  8-01-1992          NA
2 sarah             join basketball 10-19-1995          NA
3 sarah             join     soccer 10-25-1995          NA
4 sarah      participate basketball 11-24-1995           6
5 linda             join     hockey 12-21-2001          NA
6 linda             join     soccer 12-30-2001          NA
7 linda             join basketball  1-19-2002          NA

I imagine I need to use spread() to achieve this but I'm not sure how to account for the fact that certain rows in df are linked (i.e., "joins.project1" and "joins.when1" go together). A bit of further context: this is a dataset of people who "joined" or "participated" in between 1 and 200 "sports" on different dates. If they "participated" there is an additional variable called "repetitions" (i.e., how many times they participated).

CodePudding user response:

The person who designed the structure of that data has no business working with data.

library(data.table)
setDT(df)
df[, c("action", "type") := tstrsplit(event, "\\.")]
df[, c("type", "ID") := tstrsplit(type, "(?<=[A-Za-z])(?=[0-9])", perl = TRUE)]
df[is.na(ID), ID := 0]

res <- df[type == "project"]

#data.table joins:
res[df[type == "when"], when := i.data, on = .(name, action, ID)]
res[df[type == "repetitions"], repetitions := i.data, on = .(name, action, ID)]
res
#      name       data                 event        action    type ID       when repetitions
#  1: billy     soccer         joins.project         joins project  0  8-01-1992        <NA>
#  2: sarah basketball         joins.project         joins project  0 10-19-1995        <NA>
#  3: sarah     soccer        joins.project1         joins project  1 10-25-1995        <NA>
#  4: sarah basketball participation.project participation project  0 11-24-1995           6
#  5: linda     hockey         joins.project         joins project  0 12-21-2001        <NA>
#  6: linda     soccer        joins.project1         joins project  1 12-30-2001        <NA>
#  7: linda basketball        joins.project2         joins project  2  1-19-2002        <NA>

CodePudding user response:

dplyr way

library(dplyr)
library(tidyr)
library(stringr)

df %>%
  separate(event, into = c("join_or_particip", "type"), sep = "\\.") %>%
  mutate(num = as.numeric(gsub('[^0-9.-]', '', type)) %>% replace_na(., 0),
         type = gsub('[0-9] ', '', type)) %>%
  pivot_wider(id_cols = c(name, join_or_particip, num), names_from = type, values_from = data) %>%
  select(-num)


  name  join_or_particip project    when       repetitions
  <chr> <chr>            <chr>      <chr>      <chr>      
1 billy joins            soccer     8-01-1992  NA         
2 sarah joins            basketball 10-19-1995 NA         
3 sarah joins            soccer     10-25-1995 NA         
4 sarah participation    basketball 11-24-1995 6          
5 linda joins            hockey     12-21-2001 NA         
6 linda joins            soccer     12-30-2001 NA         
7 linda joins            basketball 1-19-2002  NA 
  • Related