I am trying to clean the dataset so that all data is in its appropriate cell by combining rows since they are oddly separated. There is an obvious nuance to the dataset in that there are some rows that are correctly coded and there are some that are not.
Here is an example of the data:
Rank | Store | City | Address | Transactions | Avg. Value | Dollar Sales Amt. |
---|---|---|---|---|---|---|
40 | 1404 | State College | Hamilton Square Shop Center | |||
230 W Hamilton Ave | 155548 | 52.86 | 8263499 | |||
41 | 2310 | Springfield | 149 Baltimore Pike | 300258 | 27.24 | 8211137 |
42 | 2514 | Erie | Yorktown Centre | |||
2501 West 12th Street | 190305 | 41.17 | 7862624 |
Here is an example of how I want the data:
Rank | Store | City | Address | Transactions | Avg. Value | Dollar Sales Amt. |
---|---|---|---|---|---|---|
40 | 1404 | State College | Hamilton Square Shop Center, 230 W Hamilton Ave | 155548 | 52.86 | 8263499 |
41 | 2310 | Springfield | 149 Baltimore Pike | 300258 | 27.28 | 8211137 |
42 | 2514 | Erie | Yorktown Centre, 2501 West 12th Street | 190305 | 41.17 | 7862624 |
Is there an Excel or R function to fix this, or does anyone know how to write an R functional to correct this?
I read into the CONCATENATE function in excel and realized it was not going to accomplish anything. I figured an R functional would be the only way to fix this.
CodePudding user response:
The concatenate function will work here or in the excel select the columns and by using the merge formula in the formala option you can complete the task.
CodePudding user response:
I recommend checking how the file is being parsed. From the example data you provided, it looks like the address column is being split on ", " and going to the next line. Based on this assumption alone, below is a potential solution using the tidyverse:
library(tidyverse)
original_data <- tibble(Rank = c(40,NA,41,42,NA),
Store = c(1404,NA,2310,2514,NA),
City = c("State College",NA,"Springfield","Erie",NA),
Address = c("Hamilton Square Shop Center",
"230 W Hamilton Ave","149 Baltimore Pike",
"Yorktown Centre","2501 West 12th Street"),
Transactions = c(NA,155548,300258,NA,190305),
`Avg. Value` = c(NA,52.86,27.24,NA,41.17),
`Dollar Sales Amt.` = c(NA,8263499,8211137,NA,7862624))
new_data <- original_data %>%
fill(Rank:City) %>%
group_by_at(vars(Rank:City)) %>%
mutate(Address1 = lag(Address)) %>%
slice(n()) %>%
ungroup() %>%
mutate(Address = if_else(is.na(Address1), Address,
str_c(Address1, Address, sep = ", "))) %>%
select(Rank:`Dollar Sales Amt.`)
new_data