Please would appreciate some help with removing/replacing trailing square brackets, inner quotes and slashes in a character data in R, preferably using dplyr
.
Sample:
df <- c("['Mamie Smith']", "[\"Screamin' Jay Hawkins\"]")
What I have tried:
gsub("[[]]", "", df) # Throws error
df %>%
str_replace("[[]]", "") # Also throws error
What data should look like.
"Mamie Smith", "Screamin' Jay Hawkins"
Would love your assistance.
CodePudding user response:
In base R we can make use of trimws
function:
if we are not interested in the non word parts:
trimws(df, whitespace = "\\W ")
[1] "Mamie Smith" "Screamin' Jay Hawkins"
But if we are only interested in deleting squarebrackets and quotes while leaving other punctuatons, spaces etc then:
trimws(df, whitespace = "[\\]\\[\"'] ")
[1] "Mamie Smith" "Screamin' Jay Hawkins"
CodePudding user response:
Base R:
sapply(regmatches(df, regexec('(\\w.*)(.*\\w)', df)), "[", 1)
[1] "Mamie Smith" "Screamin' Jay Hawkins"
OR
We could use str_extract
from stringr
package with this regex:
library(stringr)
str_extract(df, '(\\w.*)(.*\\w)')
[1] "Mamie Smith" "Screamin' Jay Hawkins"
CodePudding user response:
To pair up the square brackets with the accompanying type of quote, you can use:
\[(["'])(.*?)\1]
Explanation
\[
Match[
(["'])
Capture group 1, capture either"
or'
(.*?)
Capture group 2, match as least as possible characters\1
Backreference to group 1 to match the same type of quote]
Match]
In the replacement use the value of capture group 2 using \\2
df <- c("['Mamie Smith']", "[\"Screamin' Jay Hawkins\"]")
gsub("\\[([\"'])(.*?)\\1]", "\\2", df)
Output
[1] "Mamie Smith" "Screamin' Jay Hawkins"
CodePudding user response:
since [
, ]
and "
are special characters you need to 'escape' with a double backslash \\
here's some alt code:
gsub('\\"|\\[|\\]', "", df)
CodePudding user response:
Another, relatively easy, regex solution is this:
data.frame(df) %>%
mutate(df = gsub("\\[\\W |\\W \\]", "", df))
df
1 Mamie Smith
2 Screamin' Jay Hawkins
Here we remove any non-alphanumeric character (\\W
) occurring one or more times on the condition that it be preceded OR (|
) followed by a square bracket.
Alternatively, to borrow from @TaerJae but greatly simplified:
library(stringr)
data.frame(df) %>%
mutate(df = str_extract(df, '\\w.*\\w'))
Here we simply focus on the alphanumeric characters (\\w
) on either side of the string, while allowing for any characters (.*
) to occur in-between them thus capturing, for example, the apostrophe in Screamin'
and the whitespaces.
CodePudding user response:
When looking for ]
inside []
it need to be on first place []]
or esacpe it on other places. Quotes which are used for the string need to be escaped when used inside "[\"]"
or '["]'
. In the example string are no slashes (here they are only escaping "
).
gsub("[]['\"]", "", df)
#[1] "Mamie Smith" "Screamin Jay Hawkins"
Another option, avoiding escaping "
or '
is to use raw character constants r"(...)"
.
gsub(r"([]["'])", "", df)
#[1] "Mamie Smith" "Screamin Jay Hawkins"
To limit the search to the borders ^
(begin) and $
(end) need to be given.
gsub("^[]['\"]*|[]['\"]*$", "", df)
#[1] "Mamie Smith" "Screamin' Jay Hawkins"
or trimws
could be used.
trimws(df, "both", "[]['\"]")
#[1] "Mamie Smith" "Screamin' Jay Hawkins"