I really hope that title makes sense. I've been given a data set pulled from an online source that comes out a bit broken. The file should have four columns and look like this:
ID Number | Address | Builder | Status |
---|---|---|---|
352TRM | 123 Street Rd | I Build Houses | Complete |
486JRT | 4321 Road St | We Build Houses | Complete |
Instead, the ID Number column is missing and the value of the ID Number is listed in a separate row above the rest of the data like this:
Address | Builder | Status |
---|---|---|
352TRM | ||
123 Street Rd | I Build Houses | Complete |
486JRT | ||
4321 Road St | We Build Houses | Complete |
I done research and I think that essentially I need to make a long data set into a wide one. I see that there are lots of functions like spread() and pivot_wider().
For those functions I think it should be something like this:
data %>% pivot_wider(names_from = Address, values_from = Address)
but the issue is that some of the addresses are repeated (apartment complexes) and I get a "Values are not uniquely identified" error.
I'm not sure where to go next. Is this problem not fixable with reshaping if the addresses aren't all unique? Is there another option I should research? Thank you for your advice!
CodePudding user response:
It would be better to know the reason why you are getting data in such format and fix this at source.
In case, if you are not able to do that here is an option.
library(dplyr)
library(tidyr)
df %>%
mutate(row = ceiling(row_number()/2)) %>%
pivot_longer(cols = -row) %>%
filter(value != '') %>%
group_by(row) %>%
mutate(name = replace(name, 1L, 'ID')) %>%
ungroup %>%
pivot_wider() %>%
select(-row)
# ID Address Builder Status
# <chr> <chr> <chr> <chr>
#1 352TRM 123 Street Rd I Build Houses Complete
#2 486JRT 4321 Road St We Build Houses Complete
From your representation of the data it seems one row of actual data is present across two rows, so we can create groups of two rows get the data in long format. Change the column name of first Address
column to ID
and get the data back in wide format.
CodePudding user response:
You're on the right track, but there are three more things you need:
- You need to tell
pivot_wider()
which column to put each value fromAddress
into -- e.g., that we want"352TRM"
to end up inID Number
but"123 Street Rd"
to end up inAddress
. This is the purpose of thenames_from
argument. Since the ID numbers are all in rows where the other columns areNA
, we can create anames_from
column based on that. - Columns that aren't specified in
names_from
orvalues_from
and that you want in the same row post-pivoting should have the same values. We can accomplish this by fillingNA
s with the values from the row below, usingtidyr::fill(.direction = "up")
. - The above two are enough for your example data... but if there are repeated addresses in your actual data, you'll want to assign a unique ID to each set of rows that should be combined in the result.
library(tidyverse)
data %>%
mutate(
Row = (row_number() 1) %/% 2, # unique destination ID
Variable = if_else(is.na(Builder), "ID Number", "Address")
) %>%
fill(Builder, Status, .direction = "up") %>%
pivot_wider(names_from = Variable, values_from = Address) %>%
select(!Row)
#> # A tibble: 2 x 4
#> Builder Status `ID Number` Address
#> <chr> <chr> <chr> <chr>
#> 1 I Build Houses Complete 352TRM 123 Street Rd
#> 2 We Build Houses Complete 486JRT 4321 Road St
Created on 2022-03-06 by the reprex package (v2.0.1)