Expand DataFrame to complete range of values in groupby-CodePudding

I have a DataFrame that contains objects and items belonging to the objects. Items have additional data (not shown) and multiple items can belong to one object.

df = pd.DataFrame(
    {
        "object_id": [1, 1, 1, 1, 1, 2, 2, 2],
        "item_id": [1, 2, 4, 4, 5, 1, 1, 2],
        "item_count": [6, 6, 6, 6, 6, 3, 3, 3],
    }
)

I now want to group by the object_id and extract information from the associated items. While this works, it does not add items that are not already in the DataFrame (i.e. "0" values).

df_group = df.groupby(["object_id", "item_id"], as_index=False).size()

>>> df_group
   object_id  item_id  size
0          1        1     1
1          1        2     1
# e.g. item 3 missing
2          1        4     2
3          1        5     1
4          2        1     2
5          2        2     1

I now wanted to find out if there is a way to expand the groupby given the item_counts. My current naive approach is to create an dataframe list and merge the groupby afterwards:

all_items = [
    dict(object_id=entity, item_id=obj   1)
    for entity in df["object_id"].unique()
    for obj in range(df.loc[df["object_id"] == entity, "item_count"].iloc[0])
]
df_full = pd.DataFrame(all_items).merge(df_group, how="left").fillna(0).astype({"size": "int"})

>>> df_full
   object_id  item_id  size
0          1        1     1
1          1        2     1
2          1        3     0
3          1        4     2
4          1        5     1
5          1        6     0
6          2        1     2
7          2        2     1
8          2        3     0

CodePudding user response：

here is one way to do it

# summarize the duplicate item ids and create a new df
# its needed at this stage to allow us to use reindex later
df2=df.groupby(['object_id','item_id','item_count'], as_index=False).size()

# groupby the object_id then applying lambda on the group, 
# set item id as an index, which is now unique
# reindex the group result with item ids ranging from min of item id
# to the count under item_count column   1

df3=(df2.groupby('object_id', as_index=False)
 .apply(lambda x: x.set_index(['item_id']).reindex( range(x['item_id'].min(), x['item_count'].max()   1) ))
)

# null values in size, make them o
df3['size'].fillna(0, inplace=True)

# ffill null values for remaining columns
df3.ffill(inplace=True)

# drop unwanted column after reindex
df3=df3.reset_index().drop(columns='level_0')

# NaN make the column values as float, so turn them back to int
df3=df3[['object_id','item_id','item_count', 'size']].astype(int )
df3

object_id   item_id     item_count  size
0   1   1   6   1
1   1   2   6   1
2   1   3   6   0
3   1   4   6   2
4   1   5   6   1
5   1   6   6   0
6   2   1   3   2
7   2   2   3   1
8   2   3   3   0

CodePudding user response：

After you get the df_group you can reindex that group constructing the index by yourself.

The index is constructed by taking all values of object_id and for each of them all values of item_id starting from 1 up to the maximum value of item_count for that object_id.

df.groupby(["object_id", "item_id"]).size().reindex(
    pd.MultiIndex.from_tuples(
        tuples=[
            (object_id, team_id)
            for object_id in df["object_id"].unique()
            for team_id in range(
                1, df[df["object_id"] == object_id]["item_count"].max()   1
            )
        ],
        names=["object_id", "team_id"],
    ),
    fill_value=0,
)

CodePudding user response：

Try this to fill the missing values:

df = (df.set_index('item_id')
        .groupby('object_id')['item_count']
        .apply(lambda x: x.reindex(range(x.index.min(), x.index.max()   1), fill_value=0))
        .reset_index()
       )

Then do the groupby you need on the new df.

Check the solution here if I missed a detail.