Trimming everything down to alphabetic characters (R)-CodePudding

I've been trying for a long time to find a way to use a relatively easy command to cut off characters from the beginning and end of text that are not alphabetic. However, it is important that there can be e.g. numeric characters within the text.

Let me give you an example:

a <- c("1) dog with 4 legs", "- cat with 1 tail", "2./ bird with 2 wings." )
b <- c("07 mouse with 1 tail.", "2.pig with 1 nose,,", "$ cow with 4 spots_")
data <- data.frame(cbind(a, b))

The proper outcome would be this:

a <- c("dog with 4 legs", "cat with 1 tail", "bird with 2 wings" )
b <- c("mouse with 1 tail", "pig with 1 nose", "cow with 4 spots")
data_cleaned <- data.frame(cbind(a, b))

Is there a simple solution?

CodePudding user response：

We could do something like this:

First we replace all special character with space. Then we remove everything before the first character:

library(dplyr)
library(stringr)

data %>% 
  mutate(across(c(a,b), ~str_replace_all(., "[[:punct:]]", " ")),
         across(c(a,b), ~str_replace(., "^\\S* ", "")))

                     a                  b
1      dog with 4 legs mouse with 1 tail 
2      cat with 1 tail  pig with 1 nose  
3   bird with 2 wings   cow with 4 spots

CodePudding user response：

We may also use

data[] <- trimws(as.matrix(data), whitespace = "[[:punct:]0-9 ] ")

-output

> data
                  a                 b
1   dog with 4 legs mouse with 1 tail
2   cat with 1 tail   pig with 1 nose
3 bird with 2 wings  cow with 4 spots

CodePudding user response：

You could use trimws():

data[1:2] <- lapply(data[1:2], trimws, whitespace = "[^A-Za-z] ")
data

#                   a                 b
# 1   dog with 4 legs mouse with 1 tail
# 2   cat with 1 tail   pig with 1 nose
# 3 bird with 2 wings  cow with 4 spots

Its dplyr equivalent is

library(dplyr)

data %>%
  mutate(across(a:b, trimws, whitespace = "[^A-Za-z] "))