I am working with data from the Twitter API and wherever users had included Emojis in their name field, they have been translated to Unicode string representations in my dataframe. The structure of my data is somewhat like this:
user_profiles <- as.data.frame(c("Susanne Bold", "Julian K. Peard <U 0001F41C>",
"<U 0001F30A> Alexander K Miller <U 0001F30A>", "John Mason"))
colnames(user_profiles) <- "name"
which looks like this:
name
1 Susanne Bold
2 Julian K. Peard <U 0001F41C>
3 <U 0001F30A> Alexander K Miller <U 0001F30A>
4 John Mason
I am now trying to isolate the actual name into a new column using regexp:
user_profiles <- user_profiles %>%
mutate(clean_name = str_remove_all(name, "\\<U\\ [[:alnum:]]\\>[ ]?"))
But this expression 1. seems rather complicated and 2. doesn't work for identifying the pattern. I have tried multiple variations of the regexp already, weirdly enough, grepl
is able to detect the pattern with this version (which string_remove_all
doesn't accept since it is missing a closing bracket):
grepl("\\<U\\ [[:alnum:]\\>[ ]?", user_profiles$name)
[1] FALSE TRUE TRUE FALSE
# note that the second bracket around alnum is left opened
Can somebody explain this or offer an easier solution?
Thanks a lot!
CodePudding user response:
Here is an alternative way how we could do it:
library(dplyr)
library(tidyr)
user_profiles %>%
separate_rows(name, sep = '\\<|\\>') %>%
filter(!str_detect(name, 'U ')) %>%
mutate(name = na_if(name, "")) %>%
na.omit()
name
<chr>
1 "Susanne Bold"
2 "Julian K. Peard "
3 " Alexander K Miller "
4 "John Mason"
CodePudding user response:
The first str_remove_all
does not work because you missed the
quantifier after the alphanumeric pattern.
You can use
user_profiles <- user_profiles %>%
mutate(clean_name = str_remove_all(name, "<U\\ [[:xdigit:]] >\\s*"))
Do not escape <
and >
, they are never special in any regex flavor, and in TRE regex, used with base regex functions without perl=TRUE
, the \<
and \>
are word boundaries.
Pattern details
<U
-<U
string\
- a literal[[:xdigit:]]
- one or more hex chars>
- a>
char\s*
- zero or more whitespaces.
CodePudding user response:
We can add one or more (
) for the [[:alnum:]]
library(dplyr)
library(stringr)
user_profiles <- user_profiles %>%
mutate(clean_name = str_remove_all(name, "\\s*\\<U\\ [[:alnum:]] \\>\\s*"))
-output
user_profiles
name clean_name
1 Susanne Bold Susanne Bold
2 Julian K. Peard <U 0001F41C> Julian K. Peard
3 <U 0001F30A> Alexander K Miller <U 0001F30A> Alexander K Miller
4 John Mason John Mason