I am confronted with messy data. A particular point where I'm stuck is that some values are hidden inside some variables, subsetting the table as some kind of "header".
An example:
df <- data.frame(
a = c("header1", "value", "value", "header2", "value", "value"),
b = c(1, 2, 3, 4, 5, 6)
)
What I roughly want:
df_goal <- data.frame(
a = c("header1", "value", "value", "header2", "value", "value"),
b = c(1, 2, 3, 4, 5, 6),
c = c("header1", "header1", "header1", "header2", "header2", "header2")
)
So it's basically about manipulating data based on the position of the "headers".
Edit
Answers so far revolve around the "headers" following some pattern, e.g. "header1", etc. In this case, they work great.
I would, however like to come up with a more general solution where "headers" are arbitrary, e.g. "fererfw" and "ewetwet" instead of "header1" and "header2" in the above example.
CodePudding user response:
Does this work:
library(dplyr)
library(stringr)
df %>% mutate(x = cumsum(str_detect(a,'header'))) %>% group_by(x) %>%
mutate(c = str_c('header',x)) %>% ungroup() %>% select(-x)
# A tibble: 6 x 3
a b c
<chr> <dbl> <chr>
1 header1 1 header1
2 value 2 header1
3 value 3 header1
4 header2 4 header2
5 value 5 header2
6 value 6 header2
CodePudding user response:
You could use it with zoo::na.locf
:
df$c <- zoo::na.locf(ifelse(grepl('(header\\d )', df$a), df$a, NA))
Output:
> df
a b c
1 header1 1 header1
2 value 2 header1
3 value 3 header1
4 header2 4 header2
5 value 5 header2
6 value 6 header2
>
Edit:
With your edit, of arbitrary headers, try:
df$c <- zoo::na.locf(ifelse(grepl('(\\w \\d )', df$a), df$a, NA))
CodePudding user response:
If the headers are arbitrary without any pattern in them then you can create a vector of known headers, replace the remaining values with NA
and use tidyr::fill
.
library(dplyr)
library(tidyr)
headers <- c('header1', 'header2')
df %>%
mutate(c = replace(a, !a %in% headers, NA)) %>%
fill(c)
# a b c
#1 header1 1 header1
#2 value 2 header1
#3 value 3 header1
#4 header2 4 header2
#5 value 5 header2
#6 value 6 header2
CodePudding user response:
Late to the party but here is a dependency-free solution:
df$c <-
rep(
grep('^header', df$a, value = TRUE),
rle(cumsum(grepl('^header', df$a)))$lengths
)
# a b c
# 1 header1 1 header1
# 2 value 2 header1
# 3 value 3 header1
# 4 header2 4 header2
# 5 value 5 header2
# 6 value 6 header2