Filling rows based on column partitions & conditions in Python-CodePudding

My goal is to iterate through a list of possible B values, such that each ID (col A) will have new rows added with C = 0 where the possible B value did not previously exist in the DF.

I have a dataframe with:

    A    B   C
0   id1  2   10
1   id1  3   20
2   id2  1   30

possible_B_values = [1 2 3]

Resulting in:

    A    B   C
0   id1  1   0
1   id1  2   10
2   id1  3   20
3   id2  1   30
4   id2  2   0
5   id2  3   0

Thanks in advance!

CodePudding user response：

Using some index trickery:

import pandas as pd

df = pd.read_clipboard() # Your df here
possible_B_values = [1, 2, 3]

extrapolate_columns = ["A", "B"]
index = pd.MultiIndex.from_product(
    [df["A"].unique(), possible_B_values],
    names=extrapolate_columns
)

out = df.set_index(extrapolate_columns).reindex(index, fill_value=0).reset_index()

out:

     A  B   C
0  id1  1   0
1  id1  2  10
2  id1  3  20
3  id2  1  30
4  id2  2   0
5  id2  3   0

CodePudding user response：

Maybe you can create a dataframe with list of tuples with possible B values and merge it with original one

import pandas as pd

# Create a list of tuples with the possible B values and a C value of 0
possible_b_values = [1, 2, 3]
possible_b_rows = [(id, b, 0) for id in df['A'].unique() for b in possible_b_values]

# Create a new DataFrame from the list of tuples
possible_b_df = pd.DataFrame(possible_b_rows, columns=['A', 'B', 'C'])

# Merge the new DataFrame with the original one, using the 'A' and 'B' columns as the keys
df = df.merge(possible_b_df, on=['A', 'B'], how='outer')

# Fill any null values in the 'C' column with 0
df['C'] = df['C'].fillna(0)

print(df)

CodePudding user response：

Here is a one-liner pure pandas way of solving this -

You set the index as B (this will help in re-indexing later)
Gropuby column A and then for column C apply the following apply function to reindex B
The lambda function x.reindex(range(1,4), fill_value=0) basically takes each group of dataframe x for each id, and then reindexes it from range(1,4) = 1,2,3 and fills the nan values with 0.
Finally you reset_index to bring A and B back into the dataframe.

out = df.set_index('B') \                                       # Set index as B
        .groupby(['A'])['C'] \                                  # Groupby A and use apply on column C
        .apply(lambda x: x.reindex(range(1,4), fill_value=0))\  # Reindex B to range(1,4) for each group and fill 0
        .reset_index()                                          # Reset index

print(out)

     A  B   C
0  id1  1   0
1  id1  2  10
2  id1  3  20
3  id2  1  30
4  id2  2   0
5  id2  3   0