Home > OS >  Nested Row Labels to Column
Nested Row Labels to Column

Time:06-02

I have a CSV that appears to be the output of an Excel Pivot Table with names nested as row labels for repeating groups. I would like to clean the data so that the row labels are repeated in a separate column, ideally using dplyr.

The data looks like this:

dd <- data.frame(variables = c("Abington", "Number of Sales","YTD Number of Sales","Median Sale Price","YTD Median Sale Price", "Acton", "Number of Sales","YTD Number of Sales","Median Sale Price","YTD Median Sale Price"), Year1 = c(" ", 16, 50,415000,413500," ",23,60,799900,704000), Year2 = c(" ",8,13,583000,575000," ",9,39,995000,800000))

dd

variables              Year1   Year2
Abington              
Number of Sales        16      8
YTD Number of Sales    50      13
Median Sale Price      415000  583000
YTD Median Sale Price  413500  575000
Acton              
Number of Sales        23      9
YTD Number of Sales    60      39
Median Sale Price      799900  995000
YTD Median Sale Price  704000  800000

And I would like it to look like this:

Town          variables               Year1  Year2           
Abington      Number of Sales         16     8
Abington      YTD Number of Sales     50     13
Abington      Median Sale Price       415000 583000
Abington      YTD Median Sale Price   413500 575000          
Acton         Number of Sales         23      9
Acton         YTD Number of Sales     60     39
Acton         Median Sale Price       799900 995000
Acton         YTD Median Sale Price   704000 800000

Thank you!

CodePudding user response:

We can use tidyverse (or dplyr & tidyr) for this:

library(tidyverse)

dd %>%
  mutate(Town = ifelse(Year1 == " " & Year2 == " ", variables, NA)) %>%
  fill(Town, .direction = "down") %>%
  filter(Town != variables) %>%
  relocate(Town)

Resulting in:

      Town             variables  Year1  Year2
1 Abington       Number of Sales     16      8
2 Abington   YTD Number of Sales     50     13
3 Abington     Median Sale Price 415000 583000
4 Abington YTD Median Sale Price 413500 575000
5    Acton       Number of Sales     23      9
6    Acton   YTD Number of Sales     60     39
7    Acton     Median Sale Price 799900 995000
8    Acton YTD Median Sale Price 704000  8e 05

Important to note that the empty values at Year1 and Year2 are actually whitespaces (" ") rather than empty strings or NA.

CodePudding user response:

Here is another approach:

bind_cols(
  tibble(Town=rep(filter(dd,is.na(as.numeric(Year1)))$variables, each=4)),
  filter(dd,!is.na(as.numeric(Year1)))
)

Output:

  Town     variables             Year1  Year2 
  <chr>    <chr>                 <chr>  <chr> 
1 Abington Number of Sales       16     8     
2 Abington YTD Number of Sales   50     13    
3 Abington Median Sale Price     415000 583000
4 Abington YTD Median Sale Price 413500 575000
5 Acton    Number of Sales       23     9     
6 Acton    YTD Number of Sales   60     39    
7 Acton    Median Sale Price     799900 995000
8 Acton    YTD Median Sale Price 704000 8e 05 
  • Related