Update pandas dataframe column with substring from another dataframe-CodePudding

I have a dataframe where some of the informations are on the wrong field and I need to change it to the wright column. The problem is that those informations are user inserted, so they are in the middle of strings and in differents ways. The df below is a small example of the problem:

|         String          |      Info A      |      Info B      |
|-------------------------|------------------|------------------|
|        'some text'      |         50       |         60       |
|   'A=70, B=80, text'    |                  |                  |
|  'text, A = 10m, B:20'  |                  |                  |

The actual version of the df has 10 variables that I need to change and about 2mi rows.

What I need to do is put those informations on the right field, as shown on the first row of the example.

I tried some things but they all had a errors or would take to much time. If someone could help me think of a solution, I would really appreciate.

CodePudding user response：

You can use a regex with str.extractall to get the variable names and values, then pivot and update:

variables = ['A', 'B']
regex = fr'\b({"|".join(variables)})\s*[=:]\s*(\d )'
# '\\b(A|B)\\s*[=:]\\s*(\\d )'

df.update(df['String']
 .str.extractall(regex)
 .reset_index(level=0)
 .pivot(index='level_0', columns=0, values=1)
 .add_prefix('Info ')
)

output:

                String Info A Info B
0            some text   50.0   60.0
1     A=70, B=80, text     70     80
2  text, A = 10m, B:20     10     20

CodePudding user response：

Here is a simple code that can help you. For your example, you can do :

import pandas as pd

df = pd.DataFrame(columns=["String","Info A","Info B"])
df["String"]=['some text','A=70, B=80, text', 'text, A = 10m, B:20']
df["Info A"]=[50,None,None]
df["Info B"]=[60,None,None]

list_strings = list(df["String"])

new_df = pd.DataFrame(columns=["String","Info A","Info B"])

for str_ in list_strings:
    Nones = [None]*len(list(df.columns))
    dict_dummy = dict(zip(list(df.columns),Nones)) 

    split_str = str_.split(sep=",")
    
    for splited_elm in split_str :
        if "A" in splited_elm and ("=" in splited_elm or ":" in splited_elm):
            dict_dummy["Info A"] = splited_elm
            
        elif "B" in splited_elm and ("=" in splited_elm or ":" in splited_elm):
            dict_dummy["Info B"] = splited_elm
        
        else:
            dict_dummy["String"] = splited_elm
    
    new_df=new_df.append(dict_dummy,ignore_index=True)

Output :

new_df
Out[47]: 
      String    Info A Info B
0  some text      None   None
1       text      A=70   B=80
2       text   A = 10m   B:20

This little script help you to classify your elements. You can do another processing to make your df better.