So I have such text:
a b c
d e
f g.
h i
j
k l.
m n o
p q.
I need to replace newlines within a sentence with a space. That is, a sentence is defined as a string starting with a non-whitespace character and ending with a dot (period).
Newlines between sentences should be preserved.
I.e. I need this:
a b c d e f g.
h i j k l.
m n o p q.
I can't come up with a regexp that does that job in Python.
I'd like to have this also for educational purposes - I would like to understand such a regex. For practical purposes I have come up with some easy non-regex methods, but they all require some code that consists of multiple steps and I would like to have an elegant solution that does that in one go.
CodePudding user response:
Frankly, I don't know how to do it with regex because there are few problems which may need different solutions and it is simpler to do it with many replace()
- there are empty lines
"\n\n"
- some lines end with space
" \n"
- some lines end with dot
".\n"
- some lines end without dot and without space
"\n"
new = text.replace('\n\n', '\n').replace(' \n', ' ').replace('\n', ' ').replace('. ', '.\n')
Full working code
text = '''a b c
d e
f g.
h i
j
k l.
m n o
p q.
'''
print(text)
print('--------')
new = text.replace('\n\n', '\n').replace(' \n', ' ').replace('\n', ' ').replace('. ', '.\n')
print(new)
CodePudding user response:
The best solution I can come up with is \n|(\.\n)
.
This matches all line feed characters, but also puts all line feed characters after a dot into a group which you can substitute back in with $1.
\n
is the line feed character, |
is the or operator and the ()
group part of your match. \.
is the dot character.
See this example https://regex101.com/r/wWsOVm/1