Simplified example of a text i have after importing with readlines:
text <- c("just", "stuff", "nothing", "interesting", "date", "06.05.2022",
"number", "1/3892", "adress", "north street 45", "name", "peter miller",
"just", "stuff", "nothing", "interesting", "date", "06.05.2022",
"number", "5/7283", "adress", "south street 11, fareaway", "west street 4",
"name", "john snow", "just", "stuff", "nothing", "interesting",
"date", "06.05.2022", "number", "7/112563", "adress", "island street 348",
"planet street 11, tortuga", "calvary road 9", "name", "hogson, michael",
"jobs, steve", "just", "stuff", "nothing", "interesting", "date",
"06.05.2022", "number", "2/1575", "adress", "bowland road 2, mexiko",
"name", "michael myers", "terry jones", "olivia wilde", "just",
"stuff", "nothing", "interesting", "date", "06.05.2022", "number",
"1/93375", "adress", "sunset boulevard", "name", "harrison ford")
The same pattern is always repeating, I would like to have a dataframe like this:
date | number | adress | name |
---|---|---|---|
06.05.2022 | 5/7283 | south street 11, fareaway, west street 4 | john snow |
06.05.2022 | 7/112563 | island street 348, planet street 11, tortuga, calvary road 9 | hogson, michael, jobs, steve |
There is always exact one date, one number, but one or more adresses and one or more names. "just stuff nothing interesting" is also always the same and can reliably be used the detect the end of the names.
I guess this could be achieved with loops, but I gave up on trying. Or is there a function which handles such irregularities? (not even sure if length is the right word for it, I hope it is clear what I mean...)
CodePudding user response:
in Base R you could rewrite your text to a valid dcf and read it in.
x <- paste(text, collapse = ' ')
x <- gsub('just stuff nothing interesting', '', x)
x <- gsub('(name|number|adress)', '\n\\1:', x)
x <- gsub("date", "\n\ndate:", x)
read.dcf(textConnection(x), all = TRUE)
date number adress name
1 06.05.2022 1/3892 north street 45 peter miller
2 06.05.2022 5/7283 south street 11, fareaway west street 4 john snow
3 06.05.2022 7/112563 island street 348 planet street 11, tortuga calvary road 9 hogson, michael jobs, steve
4 06.05.2022 2/1575 bowland road 2, mexiko michael myers terry jones olivia wilde
5 06.05.2022 1/93375 sunset boulevard harrison ford
Note that you could run cat(x)
to see what a valid dcf looks like
Using tidyverse:
text %>%
str_replace("^(number|name|adress|date)", "\n\\1:") %>%
str_replace("^(\ndate)", "\n\\1")%>%
str_c(collapse = " ")%>%
str_remove_all("just stuff nothing interesting") %>%
textConnection()%>%
read.dcf(all = TRUE)
CodePudding user response:
Here's a piped version that works given the specification. It's worth pointing out that text parsing is very fragile to subtle differences in the input, so you may require a bit of work to get this functioning correctly with your real data.
Anyway, hopefully the concept is easy enough to follow:
library(dplyr)
text %>%
paste(collapse = ' ') %>%
strsplit('just stuff nothing interesting') %>%
unlist() %>%
`[`(nzchar(.)) %>%
strsplit(' date | number | name ') %>%
unlist() %>%
`[`(nzchar(.)) %>%
gsub(' adress', '', .) %>%
matrix(nrow = 3) %>%
t() %>%
as.data.frame() %>%
setNames(c('Date', 'Address', 'Name')) %>%
as_tibble()
#> # A tibble: 5 x 3
#> Date Address Name
#> <chr> <chr> <chr>
#> 1 06.05.2022 1/3892 north street 45 "pet~
#> 2 06.05.2022 5/7283 south street 11, fareaway west street 4 "joh~
#> 3 06.05.2022 7/112563 island street 348 planet street 11, tortuga calvary~ "hog~
#> 4 06.05.2022 2/1575 bowland road 2, mexiko "mic~
#> 5 06.05.2022 1/93375 sunset boulevard "har~
Created on 2022-05-06 by the reprex package (v2.0.1)