I'm cleaning some text data and I've come across a problem associated with removing newline text. For this data, there are not merely \n
strings in the text, but \n\n
strings, as well as numbered newlines such as: \n2
and \n\n2
. The latter are my problem. How does one remove this using regex?
I'm working in R. Here is some sample text and what I've used, so far:
#string
string <- "There is a square in the apartment. \n\n4Great laughs, which I hear from the other room. 4 laughs. Several. 9 times ten.\n2"
#code attempt
gsub("[\r\\n0-9]", '', string)
The problem with this regex code is that it removes numbers and matches with the letter n
.
I would like to have the following output:
"There is a square in the apartment. Great laughs, which I hear from the other room. 4 laughs. Several. 9 times ten."
I'm using regexr for reference.
CodePudding user response:
One way to solve this problem is to use the \n\n pattern in the regular expression, which will match two newline characters together. You can then replace this pattern with an empty string to remove it. Here is an example:
# Use the gsub() function to replace the pattern "\\n\\n" with an empty string
gsub("\\n\\n", "", string)
This regular expression will only match newline characters that are next to each other, so it will not remove any other numbers or letters. You can then run this code to get the desired output.
Another way to solve this problem is to use the \n[0-9] pattern in the regular expression, which will match a newline character followed by a digit. You can then replace this pattern with an empty string to remove it. Here is an example:
# Use the gsub() function to replace the pattern "\\n[0-9]" with an empty string
gsub("\\n[0-9]", "", string)
This regular expression will only match newline characters that are followed by a digit, so it will not remove any other numbers or letters. You can then run this code to get the desired output.
CodePudding user response:
To remove newlines and numbers from your string, you can use the following regular expression:
gsub("\\n[\\n]?[0-9]?", '', string)
This will remove any \n characters that are followed by an optional \n character and a number. Note that the backslashes in the regex need to be escaped in the string, so we use two backslashes for each one in the regex.
Here's an example of using this regex in R:
#string
string <- "There is a square in the apartment. \n\n4Great laughs, which I hear from the other room. 4 laughs. Several. 9 times ten.\n2"
#code attempt
gsub("\\n[\\n]?[0-9]?", '', string)
This will output the following string:
"There is a square in the apartment. Great laughs, which I hear from the other room. 4 laughs. Several. 9 times ten."
CodePudding user response:
Writing the pattern like this [\r\\n0-9]
matches either a carriage return, one of the chars \
or n
or a digit 0-9
You could write the pattern matching 1 or more carriage returns or newlines, followed by optional digits:
[\r\n] [0-9]*
Example:
string <- "There is a square in the apartment. \n\n4Great laughs, which I hear from the other room. 4 laughs. Several. 9 times ten.\n2"
gsub("[\r\n] [0-9]*", '', string)
Output
[1] "There is a square in the apartment. Great laughs, which I hear from the other room. 4 laughs. Several. 9 times ten."
See a R demo.