I have a dataframe like the one that follows:
import pandas as pd
import numpy as np
filename = [
"f1",
"f1",
"f1",
"f1",
"f2",
"f2",
"f2",
"f2",
"f2",
"f2"
]
parts = ["p1", "p1", "p2", "p2", "p1", "p1", "p2", "p2", "p3", "p3"]
groups = ["g1", "g2"] * 5
values = [1, 2, 3, 4, 10, 20, 30, 40, 50, 60]
values2 = [i 1 for i in values]
df = pd.DataFrame.from_records({"filename": filename, "part":parts, "f1": values, "group": groups, "f2":values2})
df = df.set_index(["filename", "part"])
print(df)
f1 f2 group
filename part
f1 p1 1 2 g1
p1 2 3 g2
p2 3 4 g1
p2 4 5 g2
f2 p1 10 11 g1
p1 20 21 g2
p2 30 31 g1
p2 40 41 g2
p3 50 51 g1
p3 60 61 g2
Each file is subdivided into more parts, in this case two parts for f1
and three for f2
. For each part, values f1
and f2
are extracted for a given number of features group
.
I would like to obtain a new dataframe with:
filename
, part
, f1_g1
, f2_g2
i.e. I would like to remove the column group
and move its values (g1
and g2
) in the column names.
The solution I have implemented is:
import pandas as pd
import numpy as np
filename = [
"f1",
"f1",
"f1",
"f1",
"f2",
"f2",
"f2",
"f2",
"f2",
"f2"
]
parts = ["p1", "p1", "p2", "p2", "p1", "p1", "p2", "p2", "p3", "p3"]
groups = ["g1", "g2"] * 5
values = [1, 2, 3, 4, 10, 20, 30, 40, 50, 60]
values2 = [i 1 for i in values]
df = pd.DataFrame.from_records({"filename": filename, "part":parts, "f1": values, "group": groups, "f2":values2})
df = df.set_index(["filename", "part"])
print(df)
df = df.reset_index()
groups = df.group.unique()
cols = [c for c in df.columns if (c not in ['filename', 'part', 'group'])]
records = []
for fn in df.filename.unique():
parts = df.loc[df.filename == fn, "part"].unique()
for p in parts:
d = dict()
for g in groups:
d['filename'] = fn
d['part'] = p
for c in cols:
value = df.loc[(df.filename == fn) & (df.part == p) & (df.group==g), c].iloc[0]
d[f"{c}_{g}"] = value
records.append(d)
df2 = pd.DataFrame.from_records(records).set_index(["filename", "part"])
print(df2)
that builds a DataFrame df2
:
f1_g1 f2_g1 f1_g2 f2_g2
filename part
f1 p1 1 2 2 3
p2 3 4 4 5
f2 p1 10 11 20 21
p2 30 31 40 41
p3 50 51 60 61
My actual data has plenty more columns. This solution works, but I do not think it is elegant and the best one might think of.
Is there a better way to do it?
CodePudding user response:
you get most of the way there by adding 'group' to the index and then unstacking:
df2 = df.reset_index().set_index(["filename", "part",'group']).unstack()
To get the column labels into the shape you want, you can do
df2.columns = ['_'.join(c) for c in df2.columns.to_flat_index()]
output:
f1_g1 f1_g2 f2_g1 f2_g2
filename part
f1 p1 1 2 2 3
p2 3 4 4 5
f2 p1 10 20 11 21
p2 30 40 31 41
p3 50 60 51 61