I have a dataframe in R with a 'Name' column that contains some special characters - some more obvious than others.
Input
NAMES
�OS� M�REN�
P*TE* CAR** **
#LEX ##OPPS
Desired output
list of values that represent the 'special characters'
[#, *, �, ...]
I am currently flagging which rows contains these characters with the following code but I just want to identify and then create a new list of values that represent the non-ascii characters.
Code
library(dplyr)
df %>% mutate(
has_non_letters = grepl("[^\\p{L} ]", df$names, perl = TRUE)
CodePudding user response:
Base R approach:
x <- unique(unlist(strsplit(df$NAMES, "")))
x <- x[x !=" "]
x <- gsub("[0-9A-Za-z/' ]","" , x ,ignore.case = TRUE)
x <- x[x !=""]
x
[1] "�" "*" "#"
First answer: For this example we could:
library(dplyr)
library(stringr)
df %>%
mutate(x = str_remove_all(NAMES, '[A-Z]')) %>%
pull(x)
[1] "�� ��" "** ** **" "# ##"