Home > Back-end >  Removing newlines from inside of sentences - regular expressions
Removing newlines from inside of sentences - regular expressions

Time:10-07

So I have such text:

a b c 
d e
f g.

h i
j 
k l.
m n o
p q.

I need to replace newlines within a sentence with a space. That is, a sentence is defined as a string starting with a non-whitespace character and ending with a dot (period).

Newlines between sentences should be preserved.

I.e. I need this:

a b c d e f g.
h i j k l.
m n o p q.

I can't come up with a regexp that does that job in Python.

I'd like to have this also for educational purposes - I would like to understand such a regex. For practical purposes I have come up with some easy non-regex methods, but they all require some code that consists of multiple steps and I would like to have an elegant solution that does that in one go.

CodePudding user response:

Frankly, I don't know how to do it with regex because there are few problems which may need different solutions and it is simpler to do it with many replace()

  • there are empty lines "\n\n"
  • some lines end with space " \n"
  • some lines end with dot ".\n"
  • some lines end without dot and without space "\n"
new = text.replace('\n\n', '\n').replace(' \n', ' ').replace('\n', ' ').replace('. ', '.\n')

Full working code

text = '''a b c 
d e
f g.

h i
j 
k l.
m n o
p q.

'''

print(text)

print('--------')

new = text.replace('\n\n', '\n').replace(' \n', ' ').replace('\n', ' ').replace('. ', '.\n')

print(new)

CodePudding user response:

The best solution I can come up with is \n|(\.\n). This matches all line feed characters, but also puts all line feed characters after a dot into a group which you can substitute back in with $1.

\n is the line feed character, | is the or operator and the () group part of your match. \. is the dot character.

See this example https://regex101.com/r/wWsOVm/1

  • Related