I have a messy character variable like:
df<-c("_oun_", "0000ff", "03815", "?3jhdb", "test", "1,000", "1.000")
and I would like to filter out all values that are not words. I thought a start would be to filter out all values not starting with a character.
How can I do this with tidyverse
? For the above mentioned example, the desired output would be test
.
CodePudding user response:
Some options with stringr
. The regex finds anything that starts ^
with a letter [:alpha:]
(upper or lower case) and is followed by any number
of letters.
This prints the values directly without the need to manually subset the data:
str_subset(df, "^[:alpha:] ")
[1] "test"
With manual subsetting:
df[str_detect(df, "^[:alpha:] ")]
[1] "test"
or
df[str_which(df, "^[:alpha:] ")]
[1] "test"
Keeps the vector structure intact:
str_extract(df, "^[:alpha:] ")
[1] NA NA NA NA "test" NA NA