Home > Mobile >  R how to speed up pattern matching using vectors
R how to speed up pattern matching using vectors

Time:12-06

I have a column in one dataframe with city and state names in it:

ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")

ac <- as.data.frame(ac)

I would like to search for the values in ac$ac in another data frame column, d$description and return the value of column id if there is a match.

dput(df)
structure(list(month = c(202110L, 201910L, 202005L, 201703L, 
201208L, 201502L), id = c(100559687L, 100558763L, 100558934L, 
100558946L, 100543422L, 100547618L), description = c("residential local telephone service local with more san francisco ca flat rate with eas package plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95", 
"digital video programming service multilatino ultra bensalem pa service includes digital economy multilatino digital preferred tier and certain additonal digital channels coaxial cable transmission", 
"residential all distance telephone service  unlimited  voice only harrisburg pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking coaxial cable transmission", 
"residential all distance telephone service  unlimited voice only pittsburgh pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking", 
"local spot advertising 30 second advertisement austin tx weekday 6 am 6 pm other audience demographic w18 49 number of rating points for daypart 0 29 average cpp 125", 
"residential public switched toll interstate manhattan ks ks plan area residence switched toll base period average revenue per minute 0 18 minute online"
)), row.names = c(1L, 1245L, 3800L, 10538L, 20362L, 50000L), class = "data.frame")

I have tried to do this via accessing the row indexes of the matches via the following methods:

  1. which(ac$ac %in% df$description)--this returns integer(0).
  2. grep(ac$ac, df$description, value = FALSE)--this returns the first index, 1. But this isn't vectorized.
  3. str_detect(string = ac$ac, pattern = df$description) -- but this returns all FALSE which is incorrect.

My question: how do I search for ac$ac in df$description and return the corresponding value of df$id in the event of a match? Note that the vectors are not of the same length. I am looking for ALL matches, not just the first. I would prefer something simple and fast, because my actual dataset has over 100k rows but any suggestions or ideas are welcome. Thanks.

CodePudding user response:

Try this sapply with grep.

d$id[ unlist( sapply( c$c, function(x) grep(x, d$description ) ) ) ]
[1] 100559687 100558946 100547618

CodePudding user response:

First there is no c$c assignment in the provided code. All the data is assigned to a variable called c. This variable does not have any c members (c$c) you are trying to work with.

Second it is a very bad practice to assign any data to variables called as the basic functions of R c <- c(...).

  • Related