Home > other >  Take messy dataframe with rows interspersed (i.e. a row bookending each chunk of observations) and m
Take messy dataframe with rows interspersed (i.e. a row bookending each chunk of observations) and m

Time:10-26

I have a messy dataframe where a row identifying a chunk of the previous observations (rows) bookends the last observation (row). This repeats.

A sample of the input dataframe I am working with:

val<-c("Settle0.90s10.0073", "Settle0.80s10.0070", "Settle0.70s10.0067","   
10/22/2110:51:4100PROG 0        PD    Pass        0.0150",
       "Settle0.90s10.0073", "Settle0.80s10.0070", "Settle0.70s10.0067",
       "10/22/2110:52:0300PROG 0        PD    Pass        0.0086")
type<-c("Settle", "Settle", "Settle",NA,"Settle", "Settle", "Settle",NA)
time_sec<-c(0.90,0.80,0.70,"10/22",0.90,0.80,0.70,"10/22")
sec<-c("s","s","s",NA,"s","s","s",NA)
press_psi<-c(10.0073,10.0070, 10.0067,NA,10.0073,10.0070, 10.0067,NA)

df<-data.frame(val,type,time_sec,press_psi)

I want to take that last row from each chunk of observations and use it as an identifier housed in a new column so it looks like this:

val<-c("Settle0.90s10.0073", "Settle0.80s10.0070", "Settle0.70s10.0067", 
       "Settle0.90s10.0073", "Settle0.80s10.0070", "Settle0.70s10.0067")
type<-c("Settle", "Settle", "Settle","Settle", "Settle", "Settle")
time_sec<-c(0.90,0.80,0.70,0.90,0.80,0.70)
sec<-c("s","s","s","s","s","s")
press_psi<-c(10.0073,10.0070, 10.0067,10.0073,10.0070, 10.0067)
part<-c("10/22/2110:51:4100PROG 0        PD    Pass        0.0150",
        "10/22/2110:51:4100PROG 0        PD    Pass        0.0150",
        "10/22/2110:51:4100PROG 0        PD    Pass        0.0150",
        "10/22/2110:52:0300PROG 0        PD    Pass        0.0086",
        "10/22/2110:52:0300PROG 0        PD    Pass        0.0086",
        "10/22/2110:52:0300PROG 0        PD    Pass        0.0086")
df<-data.frame(val,type,time_sec,press_psi,part)

Given that # of observations per 'part' section will vary over time, I was hoping to do some sort of approach, maybe using grepl to grab 'PD' in the messy row bookends, so that the script would work regardless of how many observations and/or 'part' sections. This worked but would require some more work to label all separate dataframes housed in the output list, giving them each a new 'part' identifier column and then rbind()-ing them.

lapply(split(df, cumsum(grepl("PD", df$val))), tail, -1)

But surely there is another way!

The messy dataframe is due to an unchangeable/non-reprogrammable piece of equipment, hence having to deal with the messiness. As you can see I already had to extract the various pieces of data from the 'val' column, which stores data as a messy string.

CodePudding user response:

You can create a new column (part) with val value which has 'PD' in it or else append an NA and fill the NA with tidyr::fill. Use filter to remove the rows with 'PD' value.

library(dplyr)
library(tidyr)

df <- df %>%
  mutate(part = replace(val, !grepl("PD", val), NA)) %>%
  fill(part, .direction = 'up') %>%
  filter(!grepl("PD", val))

df

#                 val   type time_sec press_psi                                                     part
#1 Settle0.90s10.0073 Settle      0.9   10.0073 10/22/2110:51:4100PROG 0        PD    Pass        0.0150
#2 Settle0.80s10.0070 Settle      0.8   10.0070 10/22/2110:51:4100PROG 0        PD    Pass        0.0150
#3 Settle0.70s10.0067 Settle      0.7   10.0067 10/22/2110:51:4100PROG 0        PD    Pass        0.0150
#4 Settle0.90s10.0073 Settle      0.9   10.0073 10/22/2110:52:0300PROG 0        PD    Pass        0.0086
#5 Settle0.80s10.0070 Settle      0.8   10.0070 10/22/2110:52:0300PROG 0        PD    Pass        0.0086
#6 Settle0.70s10.0067 Settle      0.7   10.0067 10/22/2110:52:0300PROG 0        PD    Pass        0.0086
  • Related