Home > Net >  How to use gsub and regex to identify and remove consecutive symbols?
How to use gsub and regex to identify and remove consecutive symbols?

Time:11-22

I have a column with values such as this:

structure(list(col1 = c(" |  |  |  |  |  |  |  |", "|  |  |  |  |  |  |  |  |  |  |  |  |  |             |", 
"|  |  |  |  |  |  |  |  |  |  |  |  |  |  | ", "stop|", "stop| | ", 
"stop | go")), class = "data.frame", row.names = c(NA, -6L))

I want to be able to remove all iterations of | when they show up consecutively, or if they show up as | | or | | |.

Currently, I'm trying to figure out all the iterations of the pipes, but they seem kind of random. I was wondering if there's a way to make sure my iterations cover the following instances:

  1. When there are more than one | consecutively
  2. When there are more than one | consecutively with a number of spaces (e.g., | | or | | |
  3. When | is at the end of the line (e.g., \\|$

I would, however, keep the pipe between stop | go.

Here's the code that I'm working with right now, but it removes the pipe in stop | go.

df$col1 <- gsub('[\\| ]{2,}|[\\|$]', '', df$col1)

I want to remove all the | symbols except for the one in stop | go.

CodePudding user response:

Maybe this works

trimws(trimws(gsub('(\\|\\s ){2,}', "", df$col1),
 whitespace = "\\s \\|"), whitespace = "\\|")

-output

[1] ""          ""          ""          "stop"      "stop"      "stop | go"

CodePudding user response:

You could do:

gsub('\\|\\s*\\||\\|\\s*$', '', df$col1)
#> [1] "       "                   "                         "
#> [3] "              "            "stop"                     
#> [5] "stop "                     "stop | go"

And a simple trimws if you don't want the spaces this leaves behind, as in akrun's answer:

trimws(gsub('\\|\\s*\\||\\|\\s*$', '', df$col1))
#> [1] ""          ""          ""          "stop"      "stop"     
#> [6] "stop | go"

CodePudding user response:

Another regex strategy is to remove |'s not followed by space and word:

trimws(gsub("\\|(?!\\s\\w)", "", df$col1, perl = TRUE))

Output:

[1] ""          ""          ""          "stop"      "stop"      "stop | go"
  • Related