I have the following data in R:
id = 1:5
col1 = c("john", "henry", "adam", "jenna", "Phone: 222 2222")
col2 = c("river B8C 9L4", "Field U9H 5E2 PP", "NA", "ocean A1B 5H1 dd", "dave")
col3 = c("Phone: 111 1111 111", "steve", "forest K0Y 1U9 hu2", "NA", "NA")
col4 = c("matt", "peter", "Phone: 333 333 1113", "Phone: 444 111 1153", "kevin")
my_data = data.frame(id, col1, col2, col3, col4)
id col1 col2 col3 col4
1 1 john river B8C 9L4 Phone: 111 1111 111 matt
2 2 henry Field U9H 5E2 PP steve peter
3 3 adam NA forest K0Y 1U9 hu2 Phone: 333 333 1113
4 4 jenna ocean A1B 5H1 dd NA Phone: 444 111 1153
5 5 Phone: 222 2222 dave NA kevin
I am trying to accomplish the following task - I would like to create a new dataset with the following columns. For each row:
- Step 1: id (trivial, this is always the first column)
- Step 2: A column with the phone number
- Step 3: A column that satisfies the following condition
'(([A-Z] ?[0-9]){3})|.', '\\1'
- Step 4: Once Step 1 - Step 3 has been completed, I would like to combine all names into a single column
Here is a sample of the desired output:
id name address phone
1 1 john matt river B8C 9L4 Phone: 111 1111 111
2 2 henry steve peter Field U9H 5E2 PP NA
3 3 adam forest K0Y 1U9 hu2 Phone: 333 333 1113
4 4 jenna ocean A1B 5H1 dd Phone: 444 111 1153
5 5 dave kevin NA Phone: 222 2222
Here is the code I have written:
my_data$col1[grep("Phone", my_data$col1)]
my_data$col2[grep("Phone", my_data$col2)]
my_data$col3[grep("Phone", my_data$col3)]
my_data$col4[grep("Phone", my_data$col4)]
my_data$col1[grep( '(([A-Z] ?[0-9]){3})|.', '\\1' , my_data$col1)]
my_data$col2[grep('(([A-Z] ?[0-9]){3})|.', '\\1', my_data$col2)]
my_data$col3[grep('(([A-Z] ?[0-9]){3})|.', '\\1', my_data$col3)]
my_data$col4[grep('(([A-Z] ?[0-9]){3})|.', '\\1', my_data$col4)]
Based on the above code, I was thinking on identifying which of the columns meet the condition in each step, and then using the COLASCE statement in dplyr to create the final dataset. But I think this might be a very long way of accomplishing this problem.
Can someone please suggest a faster way to solve this problem?
Thanks!
CodePudding user response:
Try this:
tmp <- apply(my_data[,-1], 1, function(z) { z <- z[!is.na(z) & z != "NA"]; ind <- dplyr::case_when(grepl("^Phone:", z) ~ "phone", grepl("(([A-Z] ?[0-9]){3})", z) ~ "address", TRUE ~ "name"); data.frame(lapply(split(z, ind), paste, collapse = " ")); })
tmp
# [[1]]
# address name phone
# 1 river B8C 9L4 john matt Phone: 111 1111 111
# [[2]]
# address name
# 1 Field U9H 5E2 PP henry steve peter
# [[3]]
# address name phone
# 1 forest K0Y 1U9 hu2 adam Phone: 333 333 1113
# [[4]]
# address name phone
# 1 ocean A1B 5H1 dd jenna Phone: 444 111 1153
# [[5]]
# name phone
# 1 dave kevin Phone: 222 2222
With this, we can combine them (I prefer dplyr::bind_rows
or data.table::rbindlist
).
dplyr::bind_rows(tmp)
# address name phone
# 1 river B8C 9L4 john matt Phone: 111 1111 111
# 2 Field U9H 5E2 PP henry steve peter <NA>
# 3 forest K0Y 1U9 hu2 adam Phone: 333 333 1113
# 4 ocean A1B 5H1 dd jenna Phone: 444 111 1153
# 5 <NA> dave kevin Phone: 222 2222
cbind(my_data[,1,drop=FALSE], dplyr::bind_rows(tmp))
# id address name phone
# 1 1 river B8C 9L4 john matt Phone: 111 1111 111
# 2 2 Field U9H 5E2 PP henry steve peter <NA>
# 3 3 forest K0Y 1U9 hu2 adam Phone: 333 333 1113
# 4 4 ocean A1B 5H1 dd jenna Phone: 444 111 1153
# 5 5 <NA> dave kevin Phone: 222 2222
CodePudding user response:
One solution based on grep
setNames(data.frame(my_data$id,
sapply(c("^[a-z] [a-z] $", "^[^P] .*[[:digit:]] ", "Phone"), function(srch)
apply(my_data[,-1], 1, function(x)
paste(grep(srch, x, value=T), collapse=" ")))),
c("Id", "Name", "Address", "Phone"))
Id Name Address Phone
1 1 john matt river B8C 9L4 Phone: 111 1111 111
2 2 henry steve peter Field U9H 5E2 PP
3 3 adam forest K0Y 1U9 hu2 Phone: 333 333 1113
4 4 jenna ocean A1B 5H1 dd Phone: 444 111 1153
5 5 dave kevin Phone: 222 2222