Home > Mobile >  Extract strings from list and put them to a data frame
Extract strings from list and put them to a data frame

Time:03-15

I have the following list:

MoreInfos <- list("\n                                                                                    \n                                                London\n                                            \n                                        \n                                                                                    \n                                                Service Green\n                                            \n                                        \n                                                                                    \n                                                Posted: 02 Feb 2022\n                                            \n                                        \n                                        \n                                                                            ", 
    "\n                                                                                    \n                                                London\n                                            \n                                        \n                                                                                    \n                                                Service Green\n                                            \n                                        \n                                                                                    \n                                                Posted: 21 Oct 2021\n                                            \n                                        \n                                        \n                                                                            ", 
    "\n                                                                                    \n                                                London\n                                            \n                                        \n                                                                                    \n                                                Service Green\n                                            \n                                        \n                                                                                    \n                                                Posted: 18 Mar 2021\n                                            \n                                        \n                                        \n                                                                            ", 
    "\n                                                                                    \n                                                London\n                                            \n                                        \n                                                                                    \n                                                Service Green\n                                            \n                                        \n                                                                                    \n                                                Posted: 14 Nov 2021\n                                            \n                                        \n                                        \n                                                                            ", 
    "\n                                                                                    \n                                                San Francisco, Singapore, London\n                                            \n                                        \n                                                                                    \n                                                Services & Solutions\n                                            \n                                        \n                                                                                    \n                                                Posted: 30 Jan 2020\n                                            \n                                        \n                                        \n                                                                            ", 
    "\n                                                                                    \n                                                San Francisco, Singapore, London\n                                            \n                                        \n                                                                                    \n                                                Solutions\n                                            \n                                        \n                                                                                    \n                                                Posted: 08 Jan 2002\n                                            \n                                        \n                                        \n                                                                            ")

I want to get rid of all the "/n" and empty spaces in the list. Also I need to extract the three string sections (City, Service, Date) to separate columns in a new data frame and format the date.

The output should look like this:

> df
    City                                     Service               Date
1    London                                   Service Green         02.02.2022
2    London                                   Service Green         21.10.2021
3    London                                   Service Green         18.03.2021
4    London                                   Service Green         14.11.2021
5    San Francisco, Singapore, London         Services & Solutions  30.01.2020
6    San Francisco, Singapore, London         Solutions             08.01.2002

For now I tried str_replace_all and gsub. But to me it seems very complicated.

MoreInfos <- str_replace_all(MoreInfos, c("\n"),"|" )
MoreInfos <- gsub("(\\S)\\s{2,}", "\\1", MoreInfos, perl=TRUE)
MoreInfos <- str_replace_all(MoreInfos, c("\\|\\|\\|\\|"),"|" )

I'm sure there is a simple solution to this.

CodePudding user response:

Here's an approach that can probably be simplified further. I split the strings by the newline markers and extracted the 3rd, 7th, and 11th elements, which correspond to the three variables you want to extract:

library(stringr)
library(dplyr)

MoreInfos %>% 
  str_split(pattern = "\\n", simplify = TRUE) %>% 
  as_tibble() %>% 
  select(3, 7, 11) %>% 
  mutate(
    across(where(is.character), trimws),
    V11 = dmy(str_remove(V11, "Posted: "))
  ) %>% 
  rename(City = V3, Service = V7, Date = V11)

Output:

# A tibble: 6 × 3
  City                             Service              Date      
  <chr>                            <chr>                <date>    
1 London                           Service Green        2022-02-02
2 London                           Service Green        2021-10-21
3 London                           Service Green        2021-03-18
4 London                           Service Green        2021-11-14
5 San Francisco, Singapore, London Services & Solutions 2020-01-30
6 San Francisco, Singapore, London Solutions            2002-01-08

CodePudding user response:

a <-gsub('[ \n]{2,}', ':', sub('Posted:', '', trimws(unlist(MoreInfos))))

read.table(text=a, col.names = c('City', 'Service', 'Date'), sep=':') |>
     transform(Date = as.Date(Date, "%d %b %Y"))

                              City              Service       Date
1                           London        Service Green 2022-02-02
2                           London        Service Green 2021-10-21
3                           London        Service Green 2021-03-18
4                           London        Service Green 2021-11-14
5 San Francisco, Singapore, London Services & Solutions 2020-01-30
6 San Francisco, Singapore, London            Solutions 2002-01-08
  • Related