Home > Software engineering >  R: Selecting Multiple Columns Based on Conditions
R: Selecting Multiple Columns Based on Conditions

Time:11-19

I have the following data in R:

id = 1:5
col1 = c("john", "henry", "adam", "jenna", "Phone: 222 2222")
col2 = c("river B8C 9L4", "Field U9H 5E2 PP", "NA", "ocean A1B 5H1 dd", "dave")
col3 = c("Phone: 111 1111 111", "steve", "forest K0Y 1U9 hu2", "NA", "NA")
col4 = c("matt", "peter", "Phone: 333 333 1113", "Phone: 444 111 1153", "kevin")
my_data = data.frame(id, col1, col2, col3, col4)

      id            col1             col2                col3                col4
1  1            john    river B8C 9L4 Phone: 111 1111 111                matt
2  2           henry Field U9H 5E2 PP               steve               peter
3  3            adam               NA  forest K0Y 1U9 hu2 Phone: 333 333 1113
4  4           jenna ocean A1B 5H1 dd                  NA Phone: 444 111 1153
5  5 Phone: 222 2222             dave                  NA               kevin

I am trying to accomplish the following task - I would like to create a new dataset with the following columns. For each row:

  • Step 1: id (trivial, this is always the first column)
  • Step 2: A column with the phone number
  • Step 3: A column that satisfies the following condition '(([A-Z] ?[0-9]){3})|.', '\\1'
  • Step 4: Once Step 1 - Step 3 has been completed, I would like to combine all names into a single column

Here is a sample of the desired output:

  id              name             address                phone
1  1         john matt       river B8C 9L4  Phone: 111 1111 111
2  2 henry steve peter    Field U9H 5E2 PP                   NA
3  3              adam  forest K0Y 1U9 hu2  Phone: 333 333 1113
4  4             jenna  ocean A1B 5H1 dd    Phone: 444 111 1153
5  5        dave kevin                  NA      Phone: 222 2222

Here is the code I have written:

my_data$col1[grep("Phone", my_data$col1)]
my_data$col2[grep("Phone", my_data$col2)]
my_data$col3[grep("Phone", my_data$col3)]
my_data$col4[grep("Phone", my_data$col4)]

my_data$col1[grep( '(([A-Z] ?[0-9]){3})|.', '\\1' , my_data$col1)]
my_data$col2[grep('(([A-Z] ?[0-9]){3})|.', '\\1', my_data$col2)]
my_data$col3[grep('(([A-Z] ?[0-9]){3})|.', '\\1', my_data$col3)]
my_data$col4[grep('(([A-Z] ?[0-9]){3})|.', '\\1', my_data$col4)]

Based on the above code, I was thinking on identifying which of the columns meet the condition in each step, and then using the COLASCE statement in dplyr to create the final dataset. But I think this might be a very long way of accomplishing this problem.

Can someone please suggest a faster way to solve this problem?

Thanks!

CodePudding user response:

Try this:

tmp <- apply(my_data[,-1], 1, function(z) { z <- z[!is.na(z) & z != "NA"]; ind <- dplyr::case_when(grepl("^Phone:", z) ~ "phone", grepl("(([A-Z] ?[0-9]){3})", z) ~ "address", TRUE ~ "name"); data.frame(lapply(split(z, ind), paste, collapse = " ")); })
tmp
# [[1]]
#         address      name               phone
# 1 river B8C 9L4 john matt Phone: 111 1111 111
# [[2]]
#            address              name
# 1 Field U9H 5E2 PP henry steve peter
# [[3]]
#              address name               phone
# 1 forest K0Y 1U9 hu2 adam Phone: 333 333 1113
# [[4]]
#            address  name               phone
# 1 ocean A1B 5H1 dd jenna Phone: 444 111 1153
# [[5]]
#         name           phone
# 1 dave kevin Phone: 222 2222

With this, we can combine them (I prefer dplyr::bind_rows or data.table::rbindlist).

dplyr::bind_rows(tmp)
#              address              name               phone
# 1      river B8C 9L4         john matt Phone: 111 1111 111
# 2   Field U9H 5E2 PP henry steve peter                <NA>
# 3 forest K0Y 1U9 hu2              adam Phone: 333 333 1113
# 4   ocean A1B 5H1 dd             jenna Phone: 444 111 1153
# 5               <NA>        dave kevin     Phone: 222 2222
cbind(my_data[,1,drop=FALSE], dplyr::bind_rows(tmp))
#   id            address              name               phone
# 1  1      river B8C 9L4         john matt Phone: 111 1111 111
# 2  2   Field U9H 5E2 PP henry steve peter                <NA>
# 3  3 forest K0Y 1U9 hu2              adam Phone: 333 333 1113
# 4  4   ocean A1B 5H1 dd             jenna Phone: 444 111 1153
# 5  5               <NA>        dave kevin     Phone: 222 2222

CodePudding user response:

One solution based on grep

setNames(data.frame(my_data$id, 
sapply(c("^[a-z] [a-z] $", "^[^P] .*[[:digit:]] ", "Phone"), function(srch) 
  apply(my_data[,-1], 1, function(x) 
    paste(grep(srch, x, value=T), collapse=" ")))),
  c("Id", "Name", "Address", "Phone"))
  Id              Name            Address               Phone
1  1         john matt      river B8C 9L4 Phone: 111 1111 111
2  2 henry steve peter   Field U9H 5E2 PP
3  3              adam forest K0Y 1U9 hu2 Phone: 333 333 1113
4  4             jenna   ocean A1B 5H1 dd Phone: 444 111 1153
5  5        dave kevin                        Phone: 222 2222
  • Related