I have a text file with the following pattern:
Prof. Imperdiet montes, metus elementum eleifend eget eget adipiscing augue.
Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing
A, nec, quam eleifend quis, magnis sit pretium. leo augue. amet, elit. vel
Vel, dis eget nascetur justo. imperdiet consequat et sit Nam Aenean a, Quisque
Enim. a, dui. Aenean lorem Phasellus commodo quis, pretium ultricies nascetur
tincidunt. sem. vitae,
montes, tellus. amet, venenatis natoque enim. fringilla
quis, vitae, Aenean Etiam viverra ipsum dapibus ut elementum Aenean Lorem eget,
nisi mollis Curabitur Quisque Aenean rhoncus sociis justo, sem. justo, vel
Aenean ultricies nec, eu laoreet.
Dr. Enim. vitae, feugiat in, Aenean
Abstract title: Massa. sociis dis dapibus dolor semper ipsum
jalor
Semper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet
eleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla
ligula vulputate ac, nisi. enim dapibus. Donec metus In sit dolor Nam ultricies
imperdiet. pellentesque Cras eu, massa quis porttitor parturient varius ut,
Phasellus arcu. pretium. quam augue. eu, adipiscing felis, enim. ante,
vulputate Integer dui. ultricies a, dictum rutrum. Nullam nec, quis,
consequat Cum tellus. dis felis dolor. nulla Aliquam Donec massa. justo. in,
nascetur
Semper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet
eleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla
Dr. Justo. nisi elementum ante, Donec Aenean Nulla
Abstract title:
Aenean consectetuer leo penatibus eget imperdiet nisi. consequat
lorem pretium mus.
Prof. Dr. Aliquam metus semper
Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum
eleifend
More information will be available soon.
I want to extract these parts:
Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing
Abstract title: Massa. sociis dis dapibus dolor semper ipsum jalor
Abstract title:
and
Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum eleifend More information will be available soon.
Now, I found these are helpful:
- Regex JS: Matching string between two strings including newlines
- Regular Expression to find a string included between two characters while EXCLUDING the delimiters
but (?<=(Abstract title:))(.*)(?=\n{2})
returns only
Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing
and
Abstract title:
Also I am not sure what software tool would be most efficient – awk, shell , r? Please forgive if it's noob question but I am open to suggestions.
CodePudding user response:
In R, you can extract your matches and "normalize" all whitespace inside matches to a regular single space using
x <- "Prof. Imperdiet montes, metus elementum eleifend eget eget adipiscing augue.\nAbstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing\n\nA, nec, quam eleifend quis, magnis sit pretium. leo augue. amet, elit. vel\n\nVel, dis eget nascetur justo. imperdiet consequat et sit Nam Aenean a, Quisque\nEnim. a, dui. Aenean lorem Phasellus commodo quis, pretium ultricies nascetur\ntincidunt. sem. vitae,\nmontes, tellus. amet, venenatis natoque enim. fringilla\nquis, vitae, Aenean Etiam viverra ipsum dapibus ut elementum Aenean Lorem eget,\nnisi mollis Curabitur Quisque Aenean rhoncus sociis justo, sem. justo, vel\nAenean ultricies nec, eu laoreet.\n\nDr. Enim. vitae, feugiat in, Aenean\nAbstract title: Massa. sociis dis dapibus dolor semper ipsum\njalor\n\nSemper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet\neleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla\nligula vulputate ac, nisi. enim dapibus. Donec metus In sit dolor Nam ultricies\nimperdiet. pellentesque Cras eu, massa quis porttitor parturient varius ut,\nPhasellus arcu. pretium. quam augue. eu, adipiscing felis, enim. ante,\nvulputate Integer dui. ultricies a, dictum rutrum. Nullam nec, quis,\nconsequat Cum tellus. dis felis dolor. nulla Aliquam Donec massa. justo. in,\nnascetur\nSemper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet\neleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla\n\n\nDr. Justo. nisi elementum ante, Donec Aenean Nulla\nAbstract title:\n\nAenean consectetuer leo penatibus eget imperdiet nisi. consequat\nlorem pretium mus. \n\nProf. Dr. Aliquam metus semper\nAbstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum\neleifend\nMore information will be available soon.\n"
library(stringr)
pattern <- "(?<=Abstract title:).*(?:\n(?!\n).*)*"
results <- lapply(str_extract_all(x, pattern), function(z) trimws(gsub("\\s ", " ", z)))
The results
will look like
[[1]]
[1] "Lorem ipsum dolor sit amet, consectetuer adipiscing"
[2] "Massa. sociis dis dapibus dolor semper ipsum jalor"
[3] ""
[4] "Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum eleifend More information will be available soon."
See the R demo online and the regex demo.
Regex details:
(?<=Abstract title:)
- a positive lookbehind that matches a position that is immediately preceded withAbstract title:
.*
- any zero or more chars other than line break chars as many as possible(?:\n(?!\n).*)*
- zero or more sequences of\n(?!\n)
- a line feed char not immediately followed with another line feed char.*
- any zero or more chars other than line break chars as many as possible
The lapply(..., function(z) trimws(gsub("\\s ", " ", z)))
"shrinks" the whitespace in the resulting list.
Parsing the text file into two columns
You can use
library(readr)
library(stringr)
file <- read_lines(path)
file_string <- paste(file, collapse="\n")
pattern <- "(?m)^(. )\n(Abstract title:.*(?:\n(?!\n).*)*)"
res <- str_match_all(file_string, pattern)
res <- lapply(res, function(z) trimws(gsub("\\s ", " ", z[,-1])))
The output is
[[1]]
[,1] [,2]
[1,] "Prof. Imperdiet montes, metus elementum eleifend eget eget adipiscing augue." "Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing"
[2,] "Dr. Enim. vitae, feugiat in, Aenean" "Abstract title: Massa. sociis dis dapibus dolor semper ipsum jalor"
[3,] "Dr. Justo. nisi elementum ante, Donec Aenean Nulla" "Abstract title:"
[4,] "Prof. Dr. Aliquam metus semper" "Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum eleifend More information will be available soon."
CodePudding user response:
Try this Regex,
Abstract title:(?:.|\r?\n\w)*
It captures everything like:
Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing
Abstract title: Massa. sociis dis dapibus dolor semper ipsum jalor
Abstract title:
Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum eleifend More information will be available soon.
(As you mentioned in your question)
tell me if its okay for you...