I have a large text file with sentences in rows like this:
Alpha beta.
Gamma, delta, epsilon!
Eta: theta?
I want to convert this to another text file where every word and punctuation is in a separate row with empty lines between the original sentences like this:
Alpha
beta
.
Gamma
,
delta
,
epsilon
!
Eta
:
theta
?
I have been experimenting with the following:
cat original.txt | xargs -n1 > new.txt
but this doesn't separate punctuation from the leading words and no spaces between sentences:
Alpha
beta.
Gamma,
delta,
epsilon!
Eta:
theta?
What is the solution here? (It is expected to be in Linux command line scripts as the original.txt file is quite large.)
CodePudding user response:
To get you started, try this;
grep -Eo '[[:punct:]]|[[:alnum:]] ' original.txt
The -E
option selects a slightly more modern regular expression dialect than the legacy BRE (grep
was the very first regular expression tool, created in 1969, and without options it tries to be backwards compatible, though not quite that far back.)
The -o
option says to print each match on a separate line, and the regex selects a match which is either a single piece of punctuation, or a sequence of alphanumeric symbols.
(I'm thinking you want !?
as separate punctuation. You'd have to special-case ellipsis if you want !?
as separate but ...
as a single match; grep -Eo '\.\.\. |[[:punct:]]|[[:alnum:]] '
.)
To get the empty line between sentences, too, perhaps switch to sed
or Awk.
awk '{ gsub(/ /, "\n"); gsub(/[^[:alnum:]\n]/, "\n&"); }
1; { print "" }' original.txt
The gsub
command performs a regular expression substitution. We replace every sequence of whitespace with a newline, then add a newline in front of every punctuation character. Finally, after printing the sentence, we print an empty line.
For more advanced preprocessing tasks, you might want to handle in-word punctuation like the apostrophe or the dash in n'est-ce pas and perhaps nested quotations etc; perhaps at that point you need to find an existing tool, rather than painstakingly build your own from first principles. In the end, regular expressions can only take you so far.
CodePudding user response:
With GNU sed:
$ sed -E 's/\s /\n/g;s/[[:punct:]]/\n&/g;s/$/\n/' original.txt
Alpha
beta
.
Gamma
,
delta
,
epsilon
!
Eta
:
theta
?