I receive data from a 3rd party that I can't scrub before hand. The data comes as a dataframe, where it's all in one long string (including the column headers). I need to parse out the information in the string, rearrange it based on the column headers, then put it into a dataframe in the right order. I receive millions of rows in a complex format, but I'll try to condense the issue in the example below:
The data I receive is about different meals, each meal contains one protein (PR
), one fruit (FR
), and 2 vegetables (VG
). My final dataframe needs to have the column headers [PR
, FR
, VG1
, VG2
]. Here's an example of what I receive:
# mealDF is an example of what I receive from the 3rd party
id <- c('meal1', 'meal2', 'meal3')
meals <- c(' FR Banana VG Carrot VG Celery PR Chicken ', ' VG Broccoli PR Tofu VG Celery FR Apple ', ' PR Pork VG Brussels Sprouts FR Orange VG Carrot ')
mealDF <- data.frame(id, meals, stringsAsFactors = FALSE)
mealDF
id meals
1 meal1 FR Banana VG Carrot VG Celery PR Chicken
2 meal2 VG Broccoli PR Tofu VG Celery FR Apple
3 meal3 PR Pork VG Brussels Sprouts FR Orange VG Carrot
My first step is just getting the actual food items, so parsing out the column headers.
# Here is what I do to parse out the column headers
library(stringr)
sepMeals <- as.data.frame(str_split_fixed(mealDF$meals, c(' FR | VG | PR '), 5))[,2:5]
sepMeals
V2 V3 V4 V5
1 Banana Carrot Celery Chicken
2 Broccoli Tofu Celery Apple
3 Pork Brussels Sprouts Orange Carrot
From here, I'm stuck on how to rearrange the dataframe in the correct order. My first thought is to take the string, and remove everything but the column headers (c(' FR | VG | PR ')
). Then I could find each column header's position for each row, and rearrange the columns in sepMeals
for each row based on the above vector. But like I said I'm stuck on how to do this.
Here is the desired result:
id PR FR VG1 VG2
1 meal1 Chicken Banana Carrot Celery
2 meal2 Tofu Apple Broccoli Celery
3 meal3 Pork Orange Brussels Sprouts Carrot
CodePudding user response:
Try this:
quux <- lapply(strsplit(mealDF$meals, "\\s "), Filter, f=nzchar)
quux <- lapply(quux, function(qu) lapply(split(qu, cumsum(nchar(qu) == 2)), function(z) setNames(list(paste(z[-1], collapse = " ")), z[1])))
lapply(quux, function(qu) data.frame(qu))
# [[1]]
# FR VG VG.1 PR
# 1 Banana Carrot Celery Chicken
# [[2]]
# VG PR VG.1 FR
# 1 Broccoli Tofu Celery Apple
# [[3]]
# PR VG FR VG.1
# 1 Pork Brussels Sprouts Orange Carrot
cbind(mealDF[,1,drop=FALSE], dplyr::bind_rows(lapply(quux, function(qu) data.frame(qu))))
# id FR VG VG.1 PR
# 1 meal1 Banana Carrot Celery Chicken
# 2 meal2 Apple Broccoli Celery Tofu
# 3 meal3 Orange Brussels Sprouts Carrot Pork
I'm using dplyr::bind_rows
because it is a lot more tolerant of different column names or columns in different order; data.table::rbindlist(lapply(..), use.names=TRUE, fill=TRUE)
works equally well. The base-R variant is do.call(rbind, lapply(..))
, but that is not as tolerant so requires manual comparison, reordering, etc, to get to work; that is, it works only when everything is perfectly shaped/named.
CodePudding user response:
An option with tidyverse
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
mealDF %>%
mutate(meals = trimws(meals)) %>%
separate_rows(meals, sep = "(?<=[a-z])\\s (?=[A-Z]{2,4})") %>%
extract(meals, into = c("prefix", "value"), "^(\\w )\\s (.*)") %>%
mutate(rn = rowid(id, prefix)) %>%
group_by(id, prefix) %>%
mutate(prefix = case_when(n() > 1 ~ str_c(prefix, rn), TRUE ~ prefix)) %>%
ungroup %>%
select(-rn) %>%
pivot_wider(names_from = prefix, values_from = value)
-output
# A tibble: 3 × 5
id FR VG1 VG2 PR
<chr> <chr> <chr> <chr> <chr>
1 meal1 Banana Carrot Celery Chicken
2 meal2 Apple Broccoli Celery Tofu
3 meal3 Orange Brussels Sprouts Carrot Pork
CodePudding user response:
Here is a tidyverse approach:
- separate rows
- group by id and meals
- create VG1 and VG2 with ifelse
- bring items with two word (e.g. Brussel sprouts) in one line with paste() and collapse()
- make some tweaking before pivot_wide()
library(tidyverse)
mealDF %>%
separate_rows(meals) %>%
group_by(id,meals) %>%
mutate(meals = ifelse(meals == "VG" & max(row_number() > 1), paste0(meals, row_number()), meals)) %>%
group_by(group = cumsum(str_detect(meals, "FR|VG|PR"))) %>%
mutate(names = first(meals)) %>%
mutate(y = paste(meals, collapse = " ")) %>%
slice(1) %>%
filter(names != "") %>%
mutate(y = str_squish(str_remove(y, names))) %>%
ungroup() %>%
select(-c(meals, group)) %>%
pivot_wider(names_from = names, values_from = y)
id FR VG1 VG2 PR
<chr> <chr> <chr> <chr> <chr>
1 meal1 Banana Carrot Celery Chicken
2 meal2 Apple Broccoli Celery Tofu
3 meal3 Orange Brussels Sprouts Carrot Pork