Home > Software engineering >  R: Modifying a REGEX Expression
R: Modifying a REGEX Expression

Time:11-19

I have the following dataset:

id = 1:5
col1 = c("john", "henry", "adam", "jenna", "peter")
col2 = c("river B8C 9L4", "Field U9H 5E2 PP", "NA", "ocean A1B 5H1 dd", "dave")
col3 = c("matt", "steve", "forest K0Y 1U9 hu2", "NA", "NA")
col4 = c("Phone: 111 1111 111", "Phone: 222 2222", "Phone: 333 333 1113", "Phone: 444 111 1153", "Phone: 111 111 1121")
my_data = data.frame(id, col1, col2, col3, col4)

id  col1             col2               col3                col4
1  1  john    river B8C 9L4               matt Phone: 111 1111 111
2  2 henry Field U9H 5E2 PP              steve     Phone: 222 2222
3  3  adam               NA forest K0Y 1U9 hu2 Phone: 333 333 1113
4  4 jenna ocean A1B 5H1 dd                 NA Phone: 444 111 1153
5  5 peter             dave                 NA Phone: 111 111 1121

I found this REGEX code that recognizes the following pattern - this can then be wrapped into a function:

 apply(my_data, 1, function(x) gsub('(([A-Z] ?[0-9]){3})|.', '\\1', toString(x)))

[1] "B8C 9L4" "U9H 5E2" "K0Y 1U9" "A1B 5H1" ""   

Once this has been done, is there any way to extend this code such that

  • Once the row/column with the REGEX condition has been identified, the entire contents of this row/column are extracted?

For example this, would then look like this:

[1] "river B8C 9L4 " Field U9H 5E2 PP"  "forest K0Y 1U9 hu2"  "ocean A1B 5H1 dd"  

CodePudding user response:

An option will be to loop over the rows, subset the elements that are not a "NA" or those having substring "Phone", then subset those having more than one word (str_count)

library(stringr)
na.omit(apply(my_data[-1], 1, \(x) 
    {x <- x[x != "NA"]
     x1 <- x[(!str_detect(x, "Phone"))]
    x1[str_count(x1, "\\w ") > 1][1] 
})

-output

[1] "river B8C 9L4"      "Field U9H 5E2 PP"   
[3] "forest K0Y 1U9 hu2" "ocean A1B 5H1 dd"  
  • Related