My goal is to use a regular expression in order to discard the header and footer information from a Project Gutenberg UTF-8 encoded text file.
Each book contains a 'start line' like so:
[...]
Character set encoding: UTF-8
Produced by: Emma Dudding, John Bickers, Dagny and David Widger
*** START OF THE PROJECT GUTENBERG EBOOK GRIMMS’ FAIRY TALES ***
Grimms’ Fairy Tales
By Jacob Grimm and Wilhelm Grimm
[...]
The footers look pretty similar:
Taylor, who made the first English translation in 1823, selecting about
fifty stories ‘with the amusement of some young friends principally in
view.’ They have been an essential ingredient of children’s reading ever
since.
*** END OF THE PROJECT GUTENBERG EBOOK GRIMMS’ FAIRY TALES ***
Updated editions will replace the previous one--the old editions will
be renamed.
My idea is to use these triple asterisk markers to discard headers and footers, since such an operation is useful for any Gutenberg release.
What is a good way to do this with regex?
CodePudding user response:
What is a good way to do this with regex?
To find a white space you can use ' ' or '\s'(note: \s will match all white space chars like \n, \r etc.
To find * , you will have to escape it like: \*
since * in regex means zero or more repetitions.
To check if * is repeated three times, you can escape it three times or use quantifier like \*{3}
So your regex could look like: \*{3}
This will match every time three * are found.
To match everything between three *, like in the header and footer. You can modify the regex to:
^\*{3}[\w\W]*?\*{3}$
This means:
^ - beginning of the line
\*{3} - match three *
[\w\W]*? - match every alphanumeric and non alphanumric chars
\*{3} - match three *
$ - end of the line
Test here: https://regex101.com/r/d8dcHf/1
PS: I think this regex can be optimized or maybe a better one can be created.
CodePudding user response:
I found that this string suited my purposes in finding both headers, and I don't anticipate any collisions in the body of the novels etc:
Solution
/^\*\*\*.*\*\*\*$/m
Explanation
The ^
matches the start of the line, the \*
is necessary to escape the asterisk which normally has a special purpose; to keep it simple, .*
matches anything in the middle, and the $
matches the end of line. The m
is for multiline mode, since Gutenberg works contain regularly spaced \n
newlines.
Caveat
I can imagine that if any text had a line such as:
Gadzooks!
******
I woke up in a daze...
we would end up with an accidental match. Room for improvement, but for now that may be premature optimization.