I have a pack of texts say :
- Suitcase 6l
- Backpack (28kg)
- Duffel Bag 6kg
- Purse [3kg]
- Duffel Bag [25l]
- Duffel Bag 10l
I want to only extract the type of bags before the number, space, and any special characters like [ or (, like:
- Suitecase
- Backpack
- Duffel Bag
- Purse
I tried to use to match the nondigit characters with case insensitive, but I don't know how to exclude the special characters and space.
(?i)(\D*^)
Can someone help me how to do it using regular expression?
CodePudding user response:
You could match the different listed formats l
and kg
with or without the ()
and []
and capture the type of bags in a group.
For a case insensitive match, you can prepend the regex with (?i)
or in Python use the re.I
flag.
^([A-Z].*?)\s (?:\[\d (?:l|kg)]|\(\d (?:l|kg)\)|\d (?:l|kg)\b)
^
Start of string([A-Z].*?)
Start the match with a char A-Z and then match as few as possible chars\s
Match 1 whitespace chars(?:
Non capture group for the alternatives\[\d (?:l|kg)]
Match 1 digits and eitherl
orkg
between[...]
|
Or\(\d (?:l|kg)\)
The same between(...)
|
Or\d (?:l|kg)\b
Match 1 digits and eitherl
orkg
)
Close the non capture group
CodePudding user response:
This regex will get you pretty close, with just the possibility of some extra spaces captured which you could get rid of with trim()
:
\b[a-zA-Z ] \b
This basically says to find the largest group of letters and spaces that don't contain any numbers or special characters.
CodePudding user response:
I believe this should be what you're looking for
[[:alpha:]] (\s[[:alpha:]] )?(?!\S*\n)
[[:alpha:]]
matches any group of letters
(\s[[:alpha:]] )?
optionally matches a white space and a group of letters
(?!\S*\n)
this is a negative lookahead, if looking forward there is an optional group of non whitespaces followed by a new line then the match will be discarded.