Home > other >  R cleaning messy data: variable value as header in table
R cleaning messy data: variable value as header in table

Time:10-21

I am confronted with messy data. A particular point where I'm stuck is that some values are hidden inside some variables, subsetting the table as some kind of "header".

An example:

df <- data.frame(
    a = c("header1", "value", "value", "header2", "value", "value"),
    b = c(1, 2, 3, 4, 5, 6)
    )

What I roughly want:


df_goal <- data.frame(
    a = c("header1", "value", "value", "header2", "value", "value"),
    b = c(1, 2, 3, 4, 5, 6),
    c = c("header1", "header1", "header1", "header2", "header2", "header2")
) 

So it's basically about manipulating data based on the position of the "headers".

Edit

Answers so far revolve around the "headers" following some pattern, e.g. "header1", etc. In this case, they work great.

I would, however like to come up with a more general solution where "headers" are arbitrary, e.g. "fererfw" and "ewetwet" instead of "header1" and "header2" in the above example.

CodePudding user response:

Does this work:

library(dplyr)
library(stringr)

df %>% mutate(x = cumsum(str_detect(a,'header'))) %>% group_by(x) %>% 
     mutate(c = str_c('header',x)) %>% ungroup() %>% select(-x)
# A tibble: 6 x 3
  a           b c      
  <chr>   <dbl> <chr>  
1 header1     1 header1
2 value       2 header1
3 value       3 header1
4 header2     4 header2
5 value       5 header2
6 value       6 header2

CodePudding user response:

You could use it with zoo::na.locf:

df$c <- zoo::na.locf(ifelse(grepl('(header\\d )', df$a), df$a, NA))

Output:

> df
        a b       c
1 header1 1 header1
2   value 2 header1
3   value 3 header1
4 header2 4 header2
5   value 5 header2
6   value 6 header2
> 

Edit:

With your edit, of arbitrary headers, try:

df$c <- zoo::na.locf(ifelse(grepl('(\\w \\d )', df$a), df$a, NA))

CodePudding user response:

If the headers are arbitrary without any pattern in them then you can create a vector of known headers, replace the remaining values with NA and use tidyr::fill.

library(dplyr)
library(tidyr)

headers <- c('header1', 'header2')

df %>%
  mutate(c = replace(a, !a %in% headers, NA)) %>%
  fill(c)

#        a b       c
#1 header1 1 header1
#2   value 2 header1
#3   value 3 header1
#4 header2 4 header2
#5   value 5 header2
#6   value 6 header2

CodePudding user response:

Late to the party but here is a dependency-free solution:

df$c <- 
  rep(
    grep('^header', df$a, value = TRUE), 
    rle(cumsum(grepl('^header', df$a)))$lengths
  )

#         a b       c
# 1 header1 1 header1
# 2   value 2 header1
# 3   value 3 header1
# 4 header2 4 header2
# 5   value 5 header2
# 6   value 6 header2
  •  Tags:  
  • r
  • Related