Home > Software design >  Replace multiple spaces in string, but leave singles spaces be
Replace multiple spaces in string, but leave singles spaces be

Time:11-02

I am reading a PDF file using R. I would like to transform the given text in such a way, that whenever multiple spaces are detected, I want to replace them by some value (for example "_"). I've come across questions where all spaces of 1 or more can be replaced using "\\s " (Merge Multiple spaces to single space; remove trailing/leading spaces) but this will not work for me. I have a string that looks something like this;

"[1]This is the first address                                          This is the second one
 [2]This is the third one                                                                     
 [3]This is the fourth one                                             This is the fifth"

When I apply the answers I found; replacing all spaces of 1 or more with a single space, I will not be able to recognise separate addresses anymore, because it would look like this;

gsub("\\s ", " ", str_trim(PDF))

"[1]This is the first address This is the second one
 [2]This is the third one                                                                     
 [3]This is the fourth one This is the fifth"

So what I am looking for is something like this

"[1]This is the first address_This is the second one
 [2]This is the third one_                                                                     
 [3]This is the fourth one_This is the fifth"

However if I rewrite the code used in the example, I get the following

gsub("\\s ", "_", str_trim(PDF))

"[1]This_is_the_first_address_This_is_the_second_one
 [2]This_is_the_third_one_                                                                     
 [3]This_is_the_fourth_one_This_is_the_fifth"

Would anyone know a workaround for this? Any help will be greatly appreciated.

CodePudding user response:

Whenever I come across string and reggex problems I like to refer to the stringr cheat sheet: https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf

On the second page you can see a section titled "Quantifiers", which tells us how to solve this:

library(tidyverse)

s <- "This is the first address                                          This is the second one"

str_replace(s, "\\s{2,}", "_")

(I am loading the complete tidyverse instead of just stringr here due to force of habit). Any 2 or more whitespace characters will no be replaced with _.

  •  Tags:  
  • r pdf
  • Related