Home > Net >  Advice for fixing data set made long by broken column?
Advice for fixing data set made long by broken column?

Time:03-07

I really hope that title makes sense. I've been given a data set pulled from an online source that comes out a bit broken. The file should have four columns and look like this:

ID Number Address Builder Status
352TRM 123 Street Rd I Build Houses Complete
486JRT 4321 Road St We Build Houses Complete

Instead, the ID Number column is missing and the value of the ID Number is listed in a separate row above the rest of the data like this:

Address Builder Status
352TRM
123 Street Rd I Build Houses Complete
486JRT
4321 Road St We Build Houses Complete

I done research and I think that essentially I need to make a long data set into a wide one. I see that there are lots of functions like spread() and pivot_wider().

For those functions I think it should be something like this:

data %>% pivot_wider(names_from = Address, values_from = Address)

but the issue is that some of the addresses are repeated (apartment complexes) and I get a "Values are not uniquely identified" error.

I'm not sure where to go next. Is this problem not fixable with reshaping if the addresses aren't all unique? Is there another option I should research? Thank you for your advice!

CodePudding user response:

It would be better to know the reason why you are getting data in such format and fix this at source.

In case, if you are not able to do that here is an option.

library(dplyr)
library(tidyr)

df %>%
  mutate(row = ceiling(row_number()/2)) %>%
  pivot_longer(cols = -row) %>%
  filter(value != '') %>%
  group_by(row) %>%
  mutate(name = replace(name, 1L, 'ID')) %>%
  ungroup %>%
  pivot_wider() %>%
  select(-row)

#   ID    Address       Builder         Status  
#  <chr>  <chr>         <chr>           <chr>   
#1 352TRM 123 Street Rd I Build Houses  Complete
#2 486JRT 4321 Road St  We Build Houses Complete

From your representation of the data it seems one row of actual data is present across two rows, so we can create groups of two rows get the data in long format. Change the column name of first Address column to ID and get the data back in wide format.

CodePudding user response:

You're on the right track, but there are three more things you need:

  1. You need to tell pivot_wider() which column to put each value from Address into -- e.g., that we want "352TRM" to end up in ID Number but "123 Street Rd" to end up in Address. This is the purpose of the names_from argument. Since the ID numbers are all in rows where the other columns are NA, we can create a names_from column based on that.
  2. Columns that aren't specified in names_from or values_from and that you want in the same row post-pivoting should have the same values. We can accomplish this by filling NAs with the values from the row below, using tidyr::fill(.direction = "up").
  3. The above two are enough for your example data... but if there are repeated addresses in your actual data, you'll want to assign a unique ID to each set of rows that should be combined in the result.
library(tidyverse)

data %>% 
  mutate(
    Row = (row_number()   1) %/% 2,     # unique destination ID
    Variable = if_else(is.na(Builder), "ID Number", "Address")
  ) %>% 
  fill(Builder, Status, .direction = "up") %>% 
  pivot_wider(names_from = Variable, values_from = Address) %>%
  select(!Row)

#> # A tibble: 2 x 4
#>   Builder         Status   `ID Number` Address      
#>   <chr>           <chr>    <chr>       <chr>        
#> 1 I Build Houses  Complete 352TRM      123 Street Rd
#> 2 We Build Houses Complete 486JRT      4321 Road St

Created on 2022-03-06 by the reprex package (v2.0.1)

  • Related