I have the following list:
MoreInfos <- list("\n \n London\n \n \n \n Service Green\n \n \n \n Posted: 02 Feb 2022\n \n \n \n ",
"\n \n London\n \n \n \n Service Green\n \n \n \n Posted: 21 Oct 2021\n \n \n \n ",
"\n \n London\n \n \n \n Service Green\n \n \n \n Posted: 18 Mar 2021\n \n \n \n ",
"\n \n London\n \n \n \n Service Green\n \n \n \n Posted: 14 Nov 2021\n \n \n \n ",
"\n \n San Francisco, Singapore, London\n \n \n \n Services & Solutions\n \n \n \n Posted: 30 Jan 2020\n \n \n \n ",
"\n \n San Francisco, Singapore, London\n \n \n \n Solutions\n \n \n \n Posted: 08 Jan 2002\n \n \n \n ")
I want to get rid of all the "/n" and empty spaces in the list. Also I need to extract the three string sections (City, Service, Date) to separate columns in a new data frame and format the date.
The output should look like this:
> df
City Service Date
1 London Service Green 02.02.2022
2 London Service Green 21.10.2021
3 London Service Green 18.03.2021
4 London Service Green 14.11.2021
5 San Francisco, Singapore, London Services & Solutions 30.01.2020
6 San Francisco, Singapore, London Solutions 08.01.2002
For now I tried str_replace_all
and gsub
. But to me it seems very complicated.
MoreInfos <- str_replace_all(MoreInfos, c("\n"),"|" )
MoreInfos <- gsub("(\\S)\\s{2,}", "\\1", MoreInfos, perl=TRUE)
MoreInfos <- str_replace_all(MoreInfos, c("\\|\\|\\|\\|"),"|" )
I'm sure there is a simple solution to this.
CodePudding user response:
Here's an approach that can probably be simplified further. I split the strings by the newline markers and extracted the 3rd, 7th, and 11th elements, which correspond to the three variables you want to extract:
library(stringr)
library(dplyr)
MoreInfos %>%
str_split(pattern = "\\n", simplify = TRUE) %>%
as_tibble() %>%
select(3, 7, 11) %>%
mutate(
across(where(is.character), trimws),
V11 = dmy(str_remove(V11, "Posted: "))
) %>%
rename(City = V3, Service = V7, Date = V11)
Output:
# A tibble: 6 × 3
City Service Date
<chr> <chr> <date>
1 London Service Green 2022-02-02
2 London Service Green 2021-10-21
3 London Service Green 2021-03-18
4 London Service Green 2021-11-14
5 San Francisco, Singapore, London Services & Solutions 2020-01-30
6 San Francisco, Singapore, London Solutions 2002-01-08
CodePudding user response:
a <-gsub('[ \n]{2,}', ':', sub('Posted:', '', trimws(unlist(MoreInfos))))
read.table(text=a, col.names = c('City', 'Service', 'Date'), sep=':') |>
transform(Date = as.Date(Date, "%d %b %Y"))
City Service Date
1 London Service Green 2022-02-02
2 London Service Green 2021-10-21
3 London Service Green 2021-03-18
4 London Service Green 2021-11-14
5 San Francisco, Singapore, London Services & Solutions 2020-01-30
6 San Francisco, Singapore, London Solutions 2002-01-08