I have a regex that looks like this to extract order numbers from columns:
df["Orders"].str.extract('([0-9]{9,10}[/ #_;.-]?)')
The orders column can look like this:
12
123456789
1234567890
123456789/1234567890
123456789/1/123456789
123456789 1234567890
The resulting new column in the dataframe after the regex should look like this:
NaN
123456789
1234567890
123456789/1234567890
123456789/123456789
123456789 1234567890
However, with my current regex I'm getting the following result:
NaN
123456789
1234567890
123456789/
123456789/
123456789
How can I get the result that I'm looking for?
CodePudding user response:
You can use
import pandas as pd
df = pd.DataFrame({'Orders':['12','123456789','1234567890','123456789/1234567890','123456789/1/123456789','123456789 1234567890', 'Order number: 6508955960_000010_1005500']})
df["Result"] = df["Orders"].str.findall(r'[/ #_;.-]?(?<![0-9])[0-9]{9,10}(?![0-9])').str.join('').str.lstrip('/ #_;.-')
df.loc[df['Result'] == '', 'Result'] = np.nan
See the regex demo. Details
[/ #_;.-]?(?<![0-9])[0-9]{9,10}(?![0-9])
- matches an optional/
,#
,_
,;
,.
or-
char, and then none or ten digit number not enclosed with other digitsSeries.str.findall
extracts all occurrences.str.join('')
concatenates the matches into a single string.str.lstrip('/ #_;.-')
- removes the special chars that were matched with the number at the beginning of the stringdf.loc[df['Result'] == '', 'Result'] = np.nan
- if needed - replaces empty strings withnp.nan
values in theResult
column.
Output:
>>> df
Orders Result
0 NaN NaN
1 123456789 123456789
2 1234567890 1234567890
3 123456789/1234567890 123456789/1234567890
4 123456789/1/123456789 123456789/123456789
5 123456789 1234567890 123456789 1234567890
>>>
CodePudding user response:
You can adapt the next code to work with data frames,
RegExp: (?:^|([/ #_;.-]))(?:\d{1,8})(?!\d)
(?:\d{1,8})(?!\d)
- find a number (<9 digits)([/ #_;.-])
- preceded by one/none of the possible delimiters (group #1)
Conditionally replace with NaN
or empty string - subst
uses match.group(1)
to differ between the two options:
- standalone-invalid -
12
- invalid-with-delimiter -
/1
import re
regex = r"(?:^|([/ #_;.-]))(?:\d{1,8})(?!\d)"
test_str = ("12\n"
"123456789\n"
"1234567890\n"
"123456789/1234567890\n"
"123456789/1/123456789\n"
"123456789 1234567890")
def subst(match):
m = match.group(1)
return "" if m else "NaN"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print(result)
Output:
NaN
123456789
1234567890
123456789/1234567890
123456789/123456789
123456789 1234567890