I have a DataFrame that contains objects and items belonging to the objects. Items have additional data (not shown) and multiple items can belong to one object.
df = pd.DataFrame(
{
"object_id": [1, 1, 1, 1, 1, 2, 2, 2],
"item_id": [1, 2, 4, 4, 5, 1, 1, 2],
"item_count": [6, 6, 6, 6, 6, 3, 3, 3],
}
)
I now want to group by the object_id
and extract information from the associated items. While this works, it does not add items that are not already in the DataFrame (i.e. "0" values).
df_group = df.groupby(["object_id", "item_id"], as_index=False).size()
>>> df_group
object_id item_id size
0 1 1 1
1 1 2 1
# e.g. item 3 missing
2 1 4 2
3 1 5 1
4 2 1 2
5 2 2 1
I now wanted to find out if there is a way to expand the groupby given the item_counts
. My current naive approach is to create an dataframe list and merge the groupby afterwards:
all_items = [
dict(object_id=entity, item_id=obj 1)
for entity in df["object_id"].unique()
for obj in range(df.loc[df["object_id"] == entity, "item_count"].iloc[0])
]
df_full = pd.DataFrame(all_items).merge(df_group, how="left").fillna(0).astype({"size": "int"})
>>> df_full
object_id item_id size
0 1 1 1
1 1 2 1
2 1 3 0
3 1 4 2
4 1 5 1
5 1 6 0
6 2 1 2
7 2 2 1
8 2 3 0
CodePudding user response:
here is one way to do it
# summarize the duplicate item ids and create a new df
# its needed at this stage to allow us to use reindex later
df2=df.groupby(['object_id','item_id','item_count'], as_index=False).size()
# groupby the object_id then applying lambda on the group,
# set item id as an index, which is now unique
# reindex the group result with item ids ranging from min of item id
# to the count under item_count column 1
df3=(df2.groupby('object_id', as_index=False)
.apply(lambda x: x.set_index(['item_id']).reindex( range(x['item_id'].min(), x['item_count'].max() 1) ))
)
# null values in size, make them o
df3['size'].fillna(0, inplace=True)
# ffill null values for remaining columns
df3.ffill(inplace=True)
# drop unwanted column after reindex
df3=df3.reset_index().drop(columns='level_0')
# NaN make the column values as float, so turn them back to int
df3=df3[['object_id','item_id','item_count', 'size']].astype(int )
df3
object_id item_id item_count size
0 1 1 6 1
1 1 2 6 1
2 1 3 6 0
3 1 4 6 2
4 1 5 6 1
5 1 6 6 0
6 2 1 3 2
7 2 2 3 1
8 2 3 3 0
CodePudding user response:
After you get the df_group
you can reindex that group constructing the index by yourself.
The index is constructed by taking all values of object_id
and for each of them all values of item_id
starting from 1 up to the maximum value of item_count
for that object_id
.
df.groupby(["object_id", "item_id"]).size().reindex(
pd.MultiIndex.from_tuples(
tuples=[
(object_id, team_id)
for object_id in df["object_id"].unique()
for team_id in range(
1, df[df["object_id"] == object_id]["item_count"].max() 1
)
],
names=["object_id", "team_id"],
),
fill_value=0,
)
CodePudding user response:
Try this to fill the missing values:
df = (df.set_index('item_id')
.groupby('object_id')['item_count']
.apply(lambda x: x.reindex(range(x.index.min(), x.index.max() 1), fill_value=0))
.reset_index()
)
Then do the groupby you need on the new df.
Check the solution here if I missed a detail.