How to split a string based on a reference list and on words at the same time efficiently in python?-CodePudding

I have a string, and a reference list of elements. I want to be able to split the string into another list of elements, taking the reference list into account. That means spliting the sentence based on reference or words. For example,

reference_list = ['10', '2 to 3', '1 and 1/2' '1/2', '1/22', ... ... etc]
my_list = "this happened at 10 o'clock and now after 2 to 3 hours has gone..meet 1 and 1/2 hours later. Visit us on 1/22 or 2/12/2012... etc.

Output should look like,

out = ["this", "happened", "at", "10", "o'clock", .... "2 to 3", "hours", ... ... "1 and 1/2", "hours", ... "1/22", "or", "2/12/2012... ]

I would appreciate any help. Thank you in advance.

Update:

I have tried this,

   reg = r'\b(%s|\w )\b' % '|'.join(reference_list)
   print(reg)
   result = []
   for e in re.finditer(reg, sentence):
       result.append(e.group())
   
   print(result)

Doesn't work.

CodePudding user response：

This is similar to the split strings and keep separators problem.

You could concatenate all of your reference_list strings into one regex and use that.

Then for the resulting list, you can split the results that aren't in the reference_list by spaces.

CodePudding user response：

Suppose we have the following data:

reference_list = ['10', '2', '1', '2 to 3', '1/2', '1 and 1/2',
                  '1/22', '2 to 3 to 4']

my_list = "this happened at 10 o'clock and now after 2 to 3 "  
          "to 4 hours has gone we've decided to meet on-time "  
          "1 and 1/2 hours later. Visit us on 1/22 or 2/12/2012"

(I have written the string this way so that it can be viewed without the need for horizontal scrolling.)

The key is to first sort reference_list to create a list new_list such that if new_list[j] is included in new_list[i] then i < j (though the opposite is generally not true.) With Ruby this could be done as follows.

new_list = reference_list.sort { |a,b| a.include?(b) ? -1 : 1 }
  #=> ["1/22", "1 and 1/2", "1/2", "2 to 3 to 4", "10", "1",
  #    "2 to 3", "2"]

I assume Python code would be similar.

Next we programmatically construct a regular expression from new_list. Again, this could be done as follows in Ruby, and I assume the Python code would be similar:

/\b(?:#{new_list.join('|')}|[\w'-] )\b/
  #=> /\b(?:1\/22|1 and 1\/2|1\/2|2 to 3 to 4|10|1|2 to 3|2|[\w'-] )\b/

If this regular expression is used with re.findall we obtain the following result:

["this", "happened", "at", "10", "o'clock", "and", "now", "after",
 "2 to 3 to 4", "hours", "has", "gone", "we've", "decided", "to",
 "meet", "on-time", "1 and 1/2", "hours", "later", "Visit", "us",
 "on", "1/22", "or", "2", "12", "2012"]

Python regex demo

Before any match has been made, and after each match has been made, findall attempts to match '1/22' at the current location in the string. If that fails to match it attempts to match '1 and 1\/2', and so on. Lastly, if all matches but the last fail it will attempt to match the catch-all [\w'-] . I have arbitrarily included an apostrophe (so "o'clock" will be matched) and hyphen (so "on-time" will be matched). Notice that all matches must be preceded and followed by a word boundary (\b).

Notice that while '2 to 3 to 4' is matched by 2 to 3 to 4, 2 to 3 and 2, the ordering of the elements of the alternation ensure that first of these is the match that is made.