How to match multiple words as a single entry with Regex?-CodePudding

I have a list of items that also includes the type and weight/size of the item. I am trying to extract the item names. I tried several different approaches, but the closest I got was extracting every word as a single entry.

The regrex pattern I used:

pattern_2=re.compile(r'[a-zA-Z] \s')

I get this result:

list=['Milk ','Loaf ','of ','Fresh ','White ','Bread ','Rice ']

the result that I want is this:

list=['Milk','Loaf of Fresh White Bread']

I tried the pattern proposed here but it matches the entire list as a block. Regular expression matching a multiline block of text

Portion of my list:

list=['Milk (regular) (1 gallon)', 'Loaf of Fresh White Bread (1 lb)', 'Rice (white) (1 lb)', 'Eggs (regular) (12)', 'Local Cheese (1 lb)']

The list itself is longer, so I am trying to find a pattern that can be used for the entire list. Is it possible to write a regex pattern that matches the list items as a whole?

CodePudding user response：

You can use

import re
l=['Milk (regular) (1 gallon)', 'Loaf of Fresh White Bread (1 lb)', 'Rice (white) (1 lb)', 'Eggs (regular) (12)', 'Local Cheese (1 lb)']
for s in l:
    m = re.search(r'^[a-z] (?:\s [a-z] )*', s, re.I)
    if m:
        print(m.group())

Or, if you use Python 3.8 :

import re
l=['Milk (regular) (1 gallon)', 'Loaf of Fresh White Bread (1 lb)', 'Rice (white) (1 lb)', 'Eggs (regular) (12)', 'Local Cheese (1 lb)']
print( [m.group() for s in l if (m := re.search(r'^[a-z] (?:\s [a-z] )*', s, re.I))] )

Ouput:

Milk
Loaf of Fresh White Bread
Rice
Eggs
Local Cheese

See the online Python demo.

The ^[a-z] (?:\s [a-z] )* regex matches one or more letters and then zero or more occurrences of one or more letters at the start of a string, in a case insensitive way due to re.I option.

CodePudding user response：

I managed up to here but I still have spaces to remove at the beginning/end of the elements:

import re

pattern_2=re.compile(r'([a-zA-Z\s] \s)')

lst = ['Milk (regular) (1 gallon)', 'Loaf of Fresh White Bread (1 lb)', 'Rice (white) (1 lb)', 'Eggs (regular) (12)', 'Local Cheese (1 lb)']
string = "Milk (regular) (1 gallon), Loaf of Fresh White Bread (1 lb), Rice (white) (1 lb), Eggs (regular) (12), Local Cheese (1 lb)"

# for a string
result_string = pattern_2.findall(string)
print(result_string)
# for a list
result_lst = pattern_2.findall(', '.join(lst))
print(result_lst)

''' OUTPUT
['Milk ', ' Loaf of Fresh White Bread ', ' Rice ', ' Eggs ', ' Local Cheese ']
['Milk ', ' Loaf of Fresh White Bread ', ' Rice ', ' Eggs ', ' Local Cheese ']
'''

CodePudding user response：

import re

s = re.findall(r'[^()] ', 'Loaf of Fresh White Bread (1 lb)')[0].rstrip()

to apply this to whole list use the following code. (given_list->result_list)

import re

given_list = ['Milk (regular) (1 gallon)', 'Loaf of Fresh White Bread (1 lb)', 'Rice (white) (1 lb)', 'Eggs (regular) (12)', 'Local Cheese (1 lb)']
result_list = [re.findall(r'[^()] ', x)[0].rstrip() for x in given_list]
print(result_list) 
# prints ['Milk', 'Loaf of Fresh White Bread', 'Rice', 'Eggs', 'Local Cheese']

Using regex is very tricky.

I recommend you to take a look at regular expression automata theory to be familar with this tool.

Explanation of the code:

r'[^()] ' can be dissected into [] and ^()

'[]' is a set of tokens(letters).

we define some set of tokens within [].

' ' means iteration of at least 1 time.

'[] ' means that certain set of tokens have been iterated 1 or more times.

'^' means complement set.

In simple terms it means "set of everything except something"

"something" here is '(', and ')'.

so "everything but parentheses" set is made.

and iteration of that set of more than 1 times.

So in human language this means

"a string of any character except '(' or ')', of length 1 or more."

findall method finds all substrings that satisfy this condition,

and makes a list of it.

[0] returns the first element of it.

rstrip removes the trailing whitespace since we couldn't remove it with regex.