Home > Blockchain >  Parsing out a long string and reordering into a dataframe
Parsing out a long string and reordering into a dataframe

Time:09-16

I receive data from a 3rd party that I can't scrub before hand. The data comes as a dataframe, where it's all in one long string (including the column headers). I need to parse out the information in the string, rearrange it based on the column headers, then put it into a dataframe in the right order. I receive millions of rows in a complex format, but I'll try to condense the issue in the example below:

The data I receive is about different meals, each meal contains one protein (PR), one fruit (FR), and 2 vegetables (VG). My final dataframe needs to have the column headers [PR, FR, VG1, VG2]. Here's an example of what I receive:

# mealDF is an example of what I receive from the 3rd party
id <- c('meal1', 'meal2', 'meal3')
meals <- c(' FR Banana VG Carrot VG Celery PR Chicken ', ' VG Broccoli PR Tofu VG Celery FR Apple ', ' PR Pork VG Brussels Sprouts FR Orange VG Carrot ')
mealDF <- data.frame(id, meals, stringsAsFactors = FALSE)
mealDF
     id                                             meals
1 meal1         FR Banana VG Carrot VG Celery PR Chicken 
2 meal2            VG Broccoli PR Tofu VG Celery FR Apple 
3 meal3  PR Pork VG Brussels Sprouts FR Orange VG Carrot 

My first step is just getting the actual food items, so parsing out the column headers.

# Here is what I do to parse out the column headers 
library(stringr)
sepMeals <- as.data.frame(str_split_fixed(mealDF$meals, c(' FR | VG | PR '), 5))[,2:5]
sepMeals
       V2               V3     V4       V5
1  Banana           Carrot Celery Chicken 
2 Broccoli             Tofu Celery   Apple 
3    Pork Brussels Sprouts Orange  Carrot 

From here, I'm stuck on how to rearrange the dataframe in the correct order. My first thought is to take the string, and remove everything but the column headers (c(' FR | VG | PR ')). Then I could find each column header's position for each row, and rearrange the columns in sepMeals for each row based on the above vector. But like I said I'm stuck on how to do this.

Here is the desired result:

     id      PR     FR              VG1    VG2
1 meal1 Chicken Banana           Carrot Celery
2 meal2    Tofu  Apple          Broccoli Celery
3 meal3    Pork Orange Brussels Sprouts Carrot

CodePudding user response:

Try this:

quux <- lapply(strsplit(mealDF$meals, "\\s "), Filter, f=nzchar)
quux <- lapply(quux, function(qu) lapply(split(qu, cumsum(nchar(qu) == 2)), function(z) setNames(list(paste(z[-1], collapse = " ")), z[1])))
lapply(quux, function(qu) data.frame(qu))
# [[1]]
#       FR     VG   VG.1      PR
# 1 Banana Carrot Celery Chicken
# [[2]]
#         VG   PR   VG.1    FR
# 1 Broccoli Tofu Celery Apple
# [[3]]
#     PR               VG     FR   VG.1
# 1 Pork Brussels Sprouts Orange Carrot

cbind(mealDF[,1,drop=FALSE], dplyr::bind_rows(lapply(quux, function(qu) data.frame(qu))))
#      id     FR               VG   VG.1      PR
# 1 meal1 Banana           Carrot Celery Chicken
# 2 meal2  Apple         Broccoli Celery    Tofu
# 3 meal3 Orange Brussels Sprouts Carrot    Pork

I'm using dplyr::bind_rows because it is a lot more tolerant of different column names or columns in different order; data.table::rbindlist(lapply(..), use.names=TRUE, fill=TRUE) works equally well. The base-R variant is do.call(rbind, lapply(..)), but that is not as tolerant so requires manual comparison, reordering, etc, to get to work; that is, it works only when everything is perfectly shaped/named.

CodePudding user response:

An option with tidyverse

library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
mealDF %>% 
  mutate(meals = trimws(meals)) %>%
  separate_rows(meals, sep = "(?<=[a-z])\\s (?=[A-Z]{2,4})") %>% 
  extract(meals, into = c("prefix", "value"), "^(\\w )\\s (.*)") %>%  
  mutate(rn = rowid(id, prefix)) %>%
  group_by(id, prefix) %>%
  mutate(prefix = case_when(n() > 1 ~ str_c(prefix, rn), TRUE ~ prefix)) %>% 
  ungroup %>% 
  select(-rn) %>% 
  pivot_wider(names_from = prefix, values_from = value)

-output

# A tibble: 3 × 5
  id    FR     VG1              VG2    PR     
  <chr> <chr>  <chr>            <chr>  <chr>  
1 meal1 Banana Carrot           Celery Chicken
2 meal2 Apple  Broccoli         Celery Tofu   
3 meal3 Orange Brussels Sprouts Carrot Pork   

CodePudding user response:

Here is a tidyverse approach:

  1. separate rows
  2. group by id and meals
  3. create VG1 and VG2 with ifelse
  4. bring items with two word (e.g. Brussel sprouts) in one line with paste() and collapse()
  5. make some tweaking before pivot_wide()
library(tidyverse)

mealDF %>% 
  separate_rows(meals) %>% 
  group_by(id,meals) %>%
  mutate(meals = ifelse(meals == "VG" & max(row_number() > 1), paste0(meals, row_number()), meals)) %>% 
  group_by(group = cumsum(str_detect(meals, "FR|VG|PR"))) %>% 
  mutate(names = first(meals)) %>% 
  mutate(y = paste(meals, collapse = " ")) %>% 
  slice(1) %>% 
  filter(names != "") %>% 
  mutate(y = str_squish(str_remove(y, names))) %>% 
  ungroup() %>% 
  select(-c(meals, group)) %>% 
  pivot_wider(names_from = names, values_from = y)
  id    FR     VG1              VG2    PR     
  <chr> <chr>  <chr>            <chr>  <chr>  
1 meal1 Banana Carrot           Celery Chicken
2 meal2 Apple  Broccoli         Celery Tofu   
3 meal3 Orange Brussels Sprouts Carrot Pork  
  • Related