I am trying to convert a single column extra
into three new headers based on the string value of extra
formatted as <column name>: <column value(s)>, ..., <column name>: <column value(s)>
where column name
is the new column and column value(s)
can be an arbitrary column value such as list, float or string.
I am working with the following dataframe:
import pandas as pd
df = pd.DataFrame(
{
"subject": [1,1],
"extra": ["category: app, datasets: [\"X\", \"Y\"], acc: [0.8, 0.9]",
"category: dev, datasets: [\"Z\", \"Y\"], acc: [0.7, 0.95]"],
}
)
desired output:
subject category datasets acc
0 1 app [X, Y] [0.8, 0.9]
1 1 dev [Z, Y] [0.7, 0.95]
and then df.explode(["acc", "datasets"])
will give the final desired result
subject category datasets acc
0 1 app X 0.8
0 1 app Y 0.9
1 1 dev Z 0.7
1 1 dev Y 0.95
CodePudding user response:
You can use pyyaml
:
import yaml
extracted_df = pd.json_normalize(df['extra'].apply(lambda x: yaml.load(re.sub(r',\s*(\w :)', '\n\\1', x), Loader=yaml.SafeLoader)))
new_df = pd.concat([df.drop('extra', axis=1), extracted_df], axis=1)
Output:
>>> new_df
subject category datasets acc
0 1 app [X, Y] [0.8, 0.9]
1 1 dev [Z, Y] [0.7, 0.95]