Home > database >  Identification of special characters in a string using R
Identification of special characters in a string using R

Time:07-02

I have a data field which consists of firm names that may contain special characters such as @,/,-. I need to identify whether the data field contains any special characters. I have tried the suggestions listed on r check if string contains special characters, How do I deal with special characters like \^$.?*| ()[{ in my regex? and R, check if special character in string but they are not giving the correct results.

The last two firm names should give a FALSE in the check field but none of the three approaches is yielding the right result. Please suggest how to correct my code. Thanks.

    df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), Firm = c("Xi'an Feibao Technology Co Ltd", 
                                               "A&B PVT LTD", "Wonik Pne Co Ltd/Old","Wooree E&L Co Ltd"
                                               , "X-Fab Silicon Foundries SE", "Yongsan S&C", "T-Gaia Corp",
                                               "Suntech Co Ltd/Seoul","IBM","31 Inc"))
    
    df$nwords <- str_count(df$Firm, "\\w ")
    
    df$check1 <- grepl('[^[:alnum:]]', df$Firm)
    
    df$check2 <- grepl('[^[:punct:]]', df$Firm)
    
    pattern <- "/|:|\\?|<|>|\\|\\\\|\\|-|&|'|*"
    df$check3 <- grepl(pattern, df$Firm)

> print(df)
   ID                           Firm nwords check1 check2 check3
1   1 Xi'an Feibao Technology Co Ltd      6   TRUE   TRUE   TRUE
2   2                    A&B PVT LTD      4   TRUE   TRUE   TRUE
3   3           Wonik Pne Co Ltd/Old      5   TRUE   TRUE   TRUE
4   4              Wooree E&L Co Ltd      5   TRUE   TRUE   TRUE
5   5     X-Fab Silicon Foundries SE      5   TRUE   TRUE   TRUE
6   6                    Yongsan S&C      3   TRUE   TRUE   TRUE
7   7                    T-Gaia Corp      3   TRUE   TRUE   TRUE
8   8           Suntech Co Ltd/Seoul      4   TRUE   TRUE   TRUE
9   9                            IBM      1  FALSE   TRUE   TRUE
10 10                         31 Inc      2   TRUE   TRUE   TRUE

CodePudding user response:

This seems to work,

grepl('[[:punct:]]', df$Firm)
#[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
  • Related