I have an OCR text document where paragraphs have been broken into individual lines. I'd like to make them whole paragraphs on a single line again (as per the original PDF).
How can I use regex, or find and replace, to remove the line breaks between two lines of text and replace them with a space?
Eg: Every line of text is on a newline. I'd like them to be whole paragraphs on a single line.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam vehicula tellus faucibus metus consequat
scelerisque. Maecenas sit amet urna quis ipsum interdum consequat. Praesent elementum libero nec
velit suscipit placerat accumsan vitae lacus. Aliquam erat volutpat. Etiam egestas lectus sed orci
venenatis, ullamcorper gravida elit pulvinar. Pellentesque imperdiet, augue pulvinar sodales dapibus,
tortor magna rutrum nulla, vel ullamcorper mi purus a diam. Ut id odio sed arcu aliquet lobortis.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Donec quam arcu, egestas feugiat eleifend blandit, vulputate non elit. Nulla a erat vel leo maximus
viverra at ac lorem. Nam non imperdiet lorem. Fusce tempor arcu massa, non commodo ligula lobortis
nec. Aliquam sit amet fringilla sapien, non euismod metus. Donec orci mi, sagittis vitae lobortis eu,
aliquet nec libero. Sed sodales magna lacus, pretium lobortis magna varius nec. Pellentesque quis
ipsum viverra orci lobortis egestas. Aliquam porttitor tincidunt ipsum, egestas placerat ante
consectetur in. Morbi porttitor lacus eu augue tincidunt, at aliquet lorem consectetur.
CodePudding user response:
You might be looking for a programatic/dynamic approach for every new scan generated so I'm not sure if this answers your question, but since you have visual studio code in your tags I will answer how to do this in vscode.
Open keyboard shortcuts from File > Preferences > Keyboard shortcuts
, and bind editor.action.joinLines
to a shortcut of your choice like for example Ctrl J
.
Then go ahead and open the text you are looking to fix in vscode, select it and press that keybinding. You will notice everything will be in 1 line. I hope I helped!
CodePudding user response:
I am using two regular expressions when removing linebreaks from OCR texts. They can be used in the Find&Replace dialog from VS Code.
- Remove linebreaks at lines ending with a hyphen:
(?<=\w)- *\n *
- Replace remaining linebreaks with whitespace, but keeping blank lines:
(?<!\n) *\n *(?!\n)
.
Note that the *
in the regular expression trims whitespace at the end and beginning of the lines.
There is also a Python tool based on Flair called dehyphen that does the job. In my experience it produces useful results but may take quite long compared to replacing linebreaks with regular expressions.