Clean duplicate phone numbers in R dataframe column-CodePudding

We have a dataframe with a Phone column that has phone numbers, however phone numbers are duplicated in many of the columns:

structure(list(Title = c("Head Coach", "Athletic Trainer", "Head Coach", 
                         "Assistant Coach", "Student Assistant", "Head Men's Basketball Coach", "Coach"
), Phone = c("(904) 256-7242\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t(904) 256-7242", 
             "256-765-5020\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t256-765-5020", 
             NA, "765.285.8142\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t765.285.8142", 
             "", "549-5849\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t549-5849", "516-302-1039"
)), row.names = c(1L,2L, 3L,4L,5L,6L,7L ), class = "data.frame")

                        Title                                                                              Phone
1                  Head Coach (904) 256-7242\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t(904) 256-7242
2            Athletic Trainer     256-765-5020\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t256-765-5020
3                  Head Coach                                                                               <NA>
4             Assistant Coach     765.285.8142\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t765.285.8142
5           Student Assistant                                                                                   
6 Head Men's Basketball Coach             549-5849\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t549-5849
7                       Coach                                                                       516-302-1039

The correct output would remove phone number duplicates:

structure(list(Title = c("Head Coach", "Athletic Trainer", "Head Coach", 
                         "Assistant Coach", "Student Assistant", "Head Men's Basketball Coach", "Coach"
), Phone = c("(904) 256-7242", 
             "256-765-5020", 
             NA, "765.285.8142", 
             "", "549-5849", "516-302-1039"
)), row.names = c(1L,2L, 3L,4L,5L,6L,7L ), class = "data.frame")

Typically I would share our progress on this, but quite frankly we are lost as to how to even get started on this. Seems like a very difficult problem especially given (a) the \r\n\t\t\t\ that appear in the strings, (b) that there are NA and missing values and (c) not every row is duplicated, (d) different formats (some area codes, some with ., some with -, some with ()). Any recommendations on how to clean this column?

CodePudding user response：

 df$Phone = sub('\r.*', '', df$Phone)

                        Title          Phone
1                  Head Coach (904) 256-7242
2            Athletic Trainer   256-765-5020
3                  Head Coach           <NA>
4             Assistant Coach   765.285.8142
5           Student Assistant               
6 Head Men's Basketball Coach       549-5849
7                       Coach   516-302-1039

CodePudding user response：

We could remove the whitespace with gsub, split at the delimiter created (,) and extract the first element

df1$Phone <- sapply(strsplit(gsub("[\r\n\t] ", ",", df1$Phone), ","), \(x) x[1])

-output

df1$Phone
[1] "(904) 256-7242" "256-765-5020"   NA   
[4] "765.285.8142"   NA               "549-5849"       "516-302-1039"

Or another option is trimws - specify the whitespace to match the one or more [\r\n\t] followed by other characters (.*)

trimws(df1$Phone, whitespace = "[\r\n\t] .*")
[1] "(904) 256-7242" "256-765-5020"   NA  
[4] "765.285.8142"   ""               "549-5849"       "516-302-1039"