I have a dataframe where some of the informations are on the wrong field and I need to change it to the wright column. The problem is that those informations are user inserted, so they are in the middle of strings and in differents ways. The df below is a small example of the problem:
| String | Info A | Info B |
|-------------------------|------------------|------------------|
| 'some text' | 50 | 60 |
| 'A=70, B=80, text' | | |
| 'text, A = 10m, B:20' | | |
The actual version of the df has 10 variables that I need to change and about 2mi rows.
What I need to do is put those informations on the right field, as shown on the first row of the example.
I tried some things but they all had a errors or would take to much time. If someone could help me think of a solution, I would really appreciate.
CodePudding user response:
You can use a regex with str.extractall
to get the variable names and values, then pivot
and update
:
variables = ['A', 'B']
regex = fr'\b({"|".join(variables)})\s*[=:]\s*(\d )'
# '\\b(A|B)\\s*[=:]\\s*(\\d )'
df.update(df['String']
.str.extractall(regex)
.reset_index(level=0)
.pivot(index='level_0', columns=0, values=1)
.add_prefix('Info ')
)
output:
String Info A Info B
0 some text 50.0 60.0
1 A=70, B=80, text 70 80
2 text, A = 10m, B:20 10 20
CodePudding user response:
Here is a simple code that can help you. For your example, you can do :
import pandas as pd
df = pd.DataFrame(columns=["String","Info A","Info B"])
df["String"]=['some text','A=70, B=80, text', 'text, A = 10m, B:20']
df["Info A"]=[50,None,None]
df["Info B"]=[60,None,None]
list_strings = list(df["String"])
new_df = pd.DataFrame(columns=["String","Info A","Info B"])
for str_ in list_strings:
Nones = [None]*len(list(df.columns))
dict_dummy = dict(zip(list(df.columns),Nones))
split_str = str_.split(sep=",")
for splited_elm in split_str :
if "A" in splited_elm and ("=" in splited_elm or ":" in splited_elm):
dict_dummy["Info A"] = splited_elm
elif "B" in splited_elm and ("=" in splited_elm or ":" in splited_elm):
dict_dummy["Info B"] = splited_elm
else:
dict_dummy["String"] = splited_elm
new_df=new_df.append(dict_dummy,ignore_index=True)
Output :
new_df
Out[47]:
String Info A Info B
0 some text None None
1 text A=70 B=80
2 text A = 10m B:20
This little script help you to classify your elements. You can do another processing to make your df better.