I have a CSV that appears to be the output of an Excel Pivot Table with names nested as row labels for repeating groups. I would like to clean the data so that the row labels are repeated in a separate column, ideally using dplyr.
The data looks like this:
dd <- data.frame(variables = c("Abington", "Number of Sales","YTD Number of Sales","Median Sale Price","YTD Median Sale Price", "Acton", "Number of Sales","YTD Number of Sales","Median Sale Price","YTD Median Sale Price"), Year1 = c(" ", 16, 50,415000,413500," ",23,60,799900,704000), Year2 = c(" ",8,13,583000,575000," ",9,39,995000,800000))
dd
variables Year1 Year2
Abington
Number of Sales 16 8
YTD Number of Sales 50 13
Median Sale Price 415000 583000
YTD Median Sale Price 413500 575000
Acton
Number of Sales 23 9
YTD Number of Sales 60 39
Median Sale Price 799900 995000
YTD Median Sale Price 704000 800000
And I would like it to look like this:
Town variables Year1 Year2
Abington Number of Sales 16 8
Abington YTD Number of Sales 50 13
Abington Median Sale Price 415000 583000
Abington YTD Median Sale Price 413500 575000
Acton Number of Sales 23 9
Acton YTD Number of Sales 60 39
Acton Median Sale Price 799900 995000
Acton YTD Median Sale Price 704000 800000
Thank you!
CodePudding user response:
We can use tidyverse
(or dplyr
& tidyr
) for this:
library(tidyverse)
dd %>%
mutate(Town = ifelse(Year1 == " " & Year2 == " ", variables, NA)) %>%
fill(Town, .direction = "down") %>%
filter(Town != variables) %>%
relocate(Town)
Resulting in:
Town variables Year1 Year2
1 Abington Number of Sales 16 8
2 Abington YTD Number of Sales 50 13
3 Abington Median Sale Price 415000 583000
4 Abington YTD Median Sale Price 413500 575000
5 Acton Number of Sales 23 9
6 Acton YTD Number of Sales 60 39
7 Acton Median Sale Price 799900 995000
8 Acton YTD Median Sale Price 704000 8e 05
Important to note that the empty values at Year1
and Year2
are actually whitespaces (" ") rather than empty strings or NA.
CodePudding user response:
Here is another approach:
bind_cols(
tibble(Town=rep(filter(dd,is.na(as.numeric(Year1)))$variables, each=4)),
filter(dd,!is.na(as.numeric(Year1)))
)
Output:
Town variables Year1 Year2
<chr> <chr> <chr> <chr>
1 Abington Number of Sales 16 8
2 Abington YTD Number of Sales 50 13
3 Abington Median Sale Price 415000 583000
4 Abington YTD Median Sale Price 413500 575000
5 Acton Number of Sales 23 9
6 Acton YTD Number of Sales 60 39
7 Acton Median Sale Price 799900 995000
8 Acton YTD Median Sale Price 704000 8e 05