Home > database >  Pull out categorical rows and apply to all subsequent rows until new category found
Pull out categorical rows and apply to all subsequent rows until new category found

Time:10-14

Here is a dummy DataFrame of my data, I have categorical rows (represented by the existence of NaN value of 'Price') and data rows (represented by a non-NaN value of 'Price').

gear = [('Baseball', None), ('Bat', 1), ('Glove', 2), ('Soccer', None), ('Shoes', 3), ('Ball', 4), ('Football', None), ('Helmet', 6)]
dummy_df = pd.DataFrame(gear, columns=['Name', 'Price'])

       Name  Price
0  Baseball    NaN
1       Bat    1.0
2     Glove    2.0
3    Soccer    NaN
4     Shoes    3.0
5      Ball    4.0
6  Football    NaN
7    Helmet    6.0

I would like to create a new column 'Sport' which is applied to each row under the category of course until you reach the next sport. The resulting DataFrame would look like this with the categorical rows removed:

       Name  Price     Sport
1       Bat    1.0  Baseball
2     Glove    2.0  Baseball
3     Shoes    3.0    Soccer
4      Ball    4.0    Soccer
5    Helmet    6.0  Football

I was thinking of creating a new column 'Sport' which is the value of Name if Price is not NaN otherwise NaN. Then use a ffill or something and then drop the NaN price rows?

CodePudding user response:

Try mask the notna, then ffill to get the correct Sport:

s = dummy_df['Price'].notna()
dummy_df.assign(Sport=dummy_df['Name'].mask(s).ffill()).loc[s]

Output:

     Name  Price     Sport
1     Bat    1.0  Baseball
2   Glove    2.0  Baseball
4   Shoes    3.0    Soccer
5    Ball    4.0    Soccer
7  Helmet    6.0  Football

CodePudding user response:

dummy_df["Sport"] = dummy_df.groupby(dummy_df.Price.isna().cumsum()).Name.transform("first")
dummy_df[dummy_df.Price.notna()]
#      Name  Price     Sport
# 1     Bat    1.0  Baseball
# 2   Glove    2.0  Baseball
# 4   Shoes    3.0    Soccer
# 5    Ball    4.0    Soccer
# 7  Helmet    6.0  Football
  • Related