R: converting text with irregular length into dataframe-CodePudding

Simplified example of a text i have after importing with readlines:

text <- c("just", "stuff", "nothing", "interesting", "date", "06.05.2022", 
"number", "1/3892", "adress", "north street 45", "name", "peter miller", 
"just", "stuff", "nothing", "interesting", "date", "06.05.2022", 
"number", "5/7283", "adress", "south street 11, fareaway", "west street 4", 
"name", "john snow", "just", "stuff", "nothing", "interesting", 
"date", "06.05.2022", "number", "7/112563", "adress", "island street 348", 
"planet street 11, tortuga", "calvary road 9", "name", "hogson, michael", 
"jobs, steve", "just", "stuff", "nothing", "interesting", "date", 
"06.05.2022", "number", "2/1575", "adress", "bowland road 2, mexiko", 
"name", "michael myers", "terry jones", "olivia wilde", "just", 
"stuff", "nothing", "interesting", "date", "06.05.2022", "number", 
"1/93375", "adress", "sunset boulevard", "name", "harrison ford")

The same pattern is always repeating, I would like to have a dataframe like this:

date	number	adress	name
06.05.2022	5/7283	south street 11, fareaway, west street 4	john snow
06.05.2022	7/112563	island street 348, planet street 11, tortuga, calvary road 9	hogson, michael, jobs, steve

There is always exact one date, one number, but one or more adresses and one or more names. "just stuff nothing interesting" is also always the same and can reliably be used the detect the end of the names.

I guess this could be achieved with loops, but I gave up on trying. Or is there a function which handles such irregularities? (not even sure if length is the right word for it, I hope it is clear what I mean...)

CodePudding user response：

in Base R you could rewrite your text to a valid dcf and read it in.

x <- paste(text, collapse = ' ')
x <- gsub('just stuff nothing interesting', '', x)
x <- gsub('(name|number|adress)', '\n\\1:', x)
x <- gsub("date", "\n\ndate:", x)
read.dcf(textConnection(x), all = TRUE)

        date   number                                                     adress                                   name
1 06.05.2022   1/3892                                            north street 45                           peter miller
2 06.05.2022   5/7283                    south street 11, fareaway west street 4                              john snow
3 06.05.2022 7/112563 island street 348 planet street 11, tortuga calvary road 9            hogson, michael jobs, steve
4 06.05.2022   2/1575                                     bowland road 2, mexiko michael myers terry jones olivia wilde
5 06.05.2022  1/93375                                           sunset boulevard                          harrison ford

Note that you could run cat(x) to see what a valid dcf looks like

Using tidyverse:

text %>%
  str_replace("^(number|name|adress|date)", "\n\\1:") %>%
  str_replace("^(\ndate)", "\n\\1")%>%
  str_c(collapse = " ")%>%
  str_remove_all("just stuff nothing interesting") %>%
  textConnection()%>%
  read.dcf(all = TRUE)

CodePudding user response：

Here's a piped version that works given the specification. It's worth pointing out that text parsing is very fragile to subtle differences in the input, so you may require a bit of work to get this functioning correctly with your real data.

Anyway, hopefully the concept is easy enough to follow:

library(dplyr)

text %>% 
  paste(collapse = ' ') %>%
  strsplit('just stuff nothing interesting') %>%
  unlist() %>%
  `[`(nzchar(.)) %>%
  strsplit(' date | number | name ') %>%
  unlist() %>%
   `[`(nzchar(.)) %>%
  gsub(' adress', '', .) %>%
  matrix(nrow = 3) %>%
  t() %>%
  as.data.frame() %>%
  setNames(c('Date', 'Address', 'Name')) %>%
  as_tibble()
#> # A tibble: 5 x 3
#>   Date       Address                                                       Name 
#>   <chr>      <chr>                                                         <chr>
#> 1 06.05.2022 1/3892 north street 45                                        "pet~
#> 2 06.05.2022 5/7283 south street 11, fareaway west street 4                "joh~
#> 3 06.05.2022 7/112563 island street 348 planet street 11, tortuga calvary~ "hog~
#> 4 06.05.2022 2/1575 bowland road 2, mexiko                                 "mic~
#> 5 06.05.2022 1/93375 sunset boulevard                                      "har~

^{Created on 2022-05-06 by the reprex package (v2.0.1)}