We have a dataframe with a Phone
column that has phone numbers, however phone numbers are duplicated in many of the columns:
structure(list(Title = c("Head Coach", "Athletic Trainer", "Head Coach",
"Assistant Coach", "Student Assistant", "Head Men's Basketball Coach", "Coach"
), Phone = c("(904) 256-7242\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t(904) 256-7242",
"256-765-5020\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t256-765-5020",
NA, "765.285.8142\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t765.285.8142",
"", "549-5849\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t549-5849", "516-302-1039"
)), row.names = c(1L,2L, 3L,4L,5L,6L,7L ), class = "data.frame")
Title Phone
1 Head Coach (904) 256-7242\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t(904) 256-7242
2 Athletic Trainer 256-765-5020\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t256-765-5020
3 Head Coach <NA>
4 Assistant Coach 765.285.8142\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t765.285.8142
5 Student Assistant
6 Head Men's Basketball Coach 549-5849\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t549-5849
7 Coach 516-302-1039
The correct output would remove phone number duplicates:
structure(list(Title = c("Head Coach", "Athletic Trainer", "Head Coach",
"Assistant Coach", "Student Assistant", "Head Men's Basketball Coach", "Coach"
), Phone = c("(904) 256-7242",
"256-765-5020",
NA, "765.285.8142",
"", "549-5849", "516-302-1039"
)), row.names = c(1L,2L, 3L,4L,5L,6L,7L ), class = "data.frame")
Typically I would share our progress on this, but quite frankly we are lost as to how to even get started on this. Seems like a very difficult problem especially given (a) the \r\n\t\t\t\
that appear in the strings, (b) that there are NA and missing values and (c) not every row is duplicated, (d) different formats (some area codes, some with ., some with -, some with ()). Any recommendations on how to clean this column?
CodePudding user response:
df$Phone = sub('\r.*', '', df$Phone)
Title Phone
1 Head Coach (904) 256-7242
2 Athletic Trainer 256-765-5020
3 Head Coach <NA>
4 Assistant Coach 765.285.8142
5 Student Assistant
6 Head Men's Basketball Coach 549-5849
7 Coach 516-302-1039
CodePudding user response:
We could remove the whitespace with gsub
, split at the delimiter created (,
) and extract the first element
df1$Phone <- sapply(strsplit(gsub("[\r\n\t] ", ",", df1$Phone), ","), \(x) x[1])
-output
df1$Phone
[1] "(904) 256-7242" "256-765-5020" NA
[4] "765.285.8142" NA "549-5849" "516-302-1039"
Or another option is trimws
- specify the whitespace
to match the one or more [\r\n\t]
followed by other characters (.*
)
trimws(df1$Phone, whitespace = "[\r\n\t] .*")
[1] "(904) 256-7242" "256-765-5020" NA
[4] "765.285.8142" "" "549-5849" "516-302-1039"