Replacement with regex or anything else in pandas dataframe-CodePudding

I want to manage data that I receive from multiple parties and convert it into structured data in order to have unity in our system.

For example, I receive data like this:

nominal diameter 1-13 x 0.5 mm
nominal diameter 10 mm
for external diameter 15mm
head dm 9.00 mm
diameter 208/20 mm height 218 mm

The goal is to retrieve in order this output

M1-13x0.5
M10
M15
M9
M208/20 H28

I thought I would do it with multiple regexes like and then replace it

df['diameter'] = df['New_size'].str.findall('^nominal diameter\s([\S] )\sx\s([\S] )\smm$')
df['diam2'] = 'Nom.M' df['diameter'].str[0].str[0] 'x' df['diameter'].str[0].str[1]

df['diameter'] = df['New_size'].str.findall('^nominal diameter\s([\S] )\smm$')
df['diam2'] = 'Nom.M' df['diameter'].str[0]

But this only helps when searching for diameter, In example 5 I need to run it again and join somehow to also add height…

Is there a way how to loop through data and replace like in Excel regex for regex? Or is my plan completely stupid and there are much better ways how to do this?

I thought of plan to split everything into pieces and replace word after word. and then joining only those info I think is necessary. but really it seems I am too complex Thasnk!

CodePudding user response：

With this type of editing, you cant be absolute the first time especially if the df is big and has diverse text. Go staged.

Option 1: Replace patterns rather than target substrings

replace values in df. First replace all alphas except x followed by space and before digit with nothing. Then from start of string replace all Non digits before numeral. Then replace all dots followed by 2 zeros

df['text_edited'] = df.replace(regex={'[a-wy-z] \s(?=\d)': 'H', '^[\D] (?=\d)': 'M', 'mm': '', '.00':''})

     

                   text                text_edited
0    nominal diameter 1-13 x 0.5 mm    M1-13 x 0.5 
1            nominal diameter 10 mm            M10 
2        for external diameter 15mm             M15
3                   head dm 9.00 mm             M9 
4  diameter 208/20 mm height 218 mm  M208/20  H218

Option 2: Replace target substrings

I assume your main aim is to replace diameter and height with initials. Do that and if there are any unwanted residuals that significantly change the attribute meaning, edit the new column further

df['text_edited'] = df.replace(regex={'diameter\s|dm\s': 'M','height\s': 'H','[^MHx0-9\W]':' ', '.00':''})



df['text_edited'] = df['text_edited'].str.strip().str.replace('^[x]','', regex=True)



                    text                 text_edited
0    nominal diameter 1-13 x 0.5 mm      M1-13 x 0.5
1            nominal diameter 10 mm              M10
2        for external diameter 15mm              M15
3                   head dm 9.00 mm               M9
4  diameter 208/20 mm height 218 mm  M208/20    H218

CodePudding user response：

I'd just use a sequence of regular expressions:

df['new_col'] = (
    df['col']
    .str.replace(r'.*nominal diameter ([\d-] ) x ([\d.] ) mm.*', r'M\1x\2', regex=True)
    .str.replace(r'.*nominal diameter ([\d.] ) mm.*', r'M\1', regex=True)
    .str.replace(r'.*for external diameter ([\d.] )mm.*', r'M\1', regex=True)
    .str.replace(r'.*head dm ([\d.] ) mm.*', r'M\1', regex=True)
    .str.replace(r'.*diameter ([\d./] ) mm height ([\d.] ) mm.*', r'M\1 H\2', regex=True)
)

Output (before):

>>> df = pd.DataFrame({'d':['nominal diameter 1-13 x 0.5 mm','nominal diameter 10 mm','for external diameter 15mm','head dm 9.00 mm','diameter 208/20 mm height 218 mm',]})
>>> df
                                col
0    nominal diameter 1-13 x 0.5 mm
1            nominal diameter 10 mm
2        for external diameter 15mm
3                   head dm 9.00 mm
4  diameter 208/20 mm height 218 mm

Output (after):

>>> df
                                col       new_col
0    nominal diameter 1-13 x 0.5 mm     M1-13x0.5
1            nominal diameter 10 mm           M10
2        for external diameter 15mm           M15
3                   head dm 9.00 mm         M9.00
4  diameter 208/20 mm height 218 mm  M208/20 H218