I want to manage data that I receive from multiple parties and convert it into structured data in order to have unity in our system.
For example, I receive data like this:
- nominal diameter 1-13 x 0.5 mm
- nominal diameter 10 mm
- for external diameter 15mm
- head dm 9.00 mm
- diameter 208/20 mm height 218 mm
The goal is to retrieve in order this output
- M1-13x0.5
- M10
- M15
- M9
- M208/20 H28
I thought I would do it with multiple regexes like and then replace it
df['diameter'] = df['New_size'].str.findall('^nominal diameter\s([\S] )\sx\s([\S] )\smm$')
df['diam2'] = 'Nom.M' df['diameter'].str[0].str[0] 'x' df['diameter'].str[0].str[1]
df['diameter'] = df['New_size'].str.findall('^nominal diameter\s([\S] )\smm$')
df['diam2'] = 'Nom.M' df['diameter'].str[0]
But this only helps when searching for diameter, In example 5 I need to run it again and join somehow to also add height…
Is there a way how to loop through data and replace like in Excel regex for regex? Or is my plan completely stupid and there are much better ways how to do this?
I thought of plan to split everything into pieces and replace word after word. and then joining only those info I think is necessary. but really it seems I am too complex Thasnk!
CodePudding user response:
With this type of editing, you cant be absolute the first time especially if the df is big and has diverse text. Go staged.
Option 1: Replace patterns rather than target substrings
replace values in df. First replace all alphas except x followed by space and before digit with nothing. Then from start of string replace all Non digits before numeral. Then replace all dots followed by 2 zeros
df['text_edited'] = df.replace(regex={'[a-wy-z] \s(?=\d)': 'H', '^[\D] (?=\d)': 'M', 'mm': '', '.00':''})
text text_edited
0 nominal diameter 1-13 x 0.5 mm M1-13 x 0.5
1 nominal diameter 10 mm M10
2 for external diameter 15mm M15
3 head dm 9.00 mm M9
4 diameter 208/20 mm height 218 mm M208/20 H218
Option 2: Replace target substrings
I assume your main aim is to replace diameter and height with initials. Do that and if there are any unwanted residuals that significantly change the attribute meaning, edit the new column further
df['text_edited'] = df.replace(regex={'diameter\s|dm\s': 'M','height\s': 'H','[^MHx0-9\W]':' ', '.00':''})
df['text_edited'] = df['text_edited'].str.strip().str.replace('^[x]','', regex=True)
text text_edited
0 nominal diameter 1-13 x 0.5 mm M1-13 x 0.5
1 nominal diameter 10 mm M10
2 for external diameter 15mm M15
3 head dm 9.00 mm M9
4 diameter 208/20 mm height 218 mm M208/20 H218
CodePudding user response:
I'd just use a sequence of regular expressions:
df['new_col'] = (
df['col']
.str.replace(r'.*nominal diameter ([\d-] ) x ([\d.] ) mm.*', r'M\1x\2', regex=True)
.str.replace(r'.*nominal diameter ([\d.] ) mm.*', r'M\1', regex=True)
.str.replace(r'.*for external diameter ([\d.] )mm.*', r'M\1', regex=True)
.str.replace(r'.*head dm ([\d.] ) mm.*', r'M\1', regex=True)
.str.replace(r'.*diameter ([\d./] ) mm height ([\d.] ) mm.*', r'M\1 H\2', regex=True)
)
Output (before):
>>> df = pd.DataFrame({'d':['nominal diameter 1-13 x 0.5 mm','nominal diameter 10 mm','for external diameter 15mm','head dm 9.00 mm','diameter 208/20 mm height 218 mm',]})
>>> df
col
0 nominal diameter 1-13 x 0.5 mm
1 nominal diameter 10 mm
2 for external diameter 15mm
3 head dm 9.00 mm
4 diameter 208/20 mm height 218 mm
Output (after):
>>> df
col new_col
0 nominal diameter 1-13 x 0.5 mm M1-13x0.5
1 nominal diameter 10 mm M10
2 for external diameter 15mm M15
3 head dm 9.00 mm M9.00
4 diameter 208/20 mm height 218 mm M208/20 H218