Here is a dummy DataFrame of my data, I have categorical rows (represented by the existence of NaN
value of 'Price'
) and data rows (represented by a non-NaN
value of 'Price'
).
gear = [('Baseball', None), ('Bat', 1), ('Glove', 2), ('Soccer', None), ('Shoes', 3), ('Ball', 4), ('Football', None), ('Helmet', 6)]
dummy_df = pd.DataFrame(gear, columns=['Name', 'Price'])
Name Price
0 Baseball NaN
1 Bat 1.0
2 Glove 2.0
3 Soccer NaN
4 Shoes 3.0
5 Ball 4.0
6 Football NaN
7 Helmet 6.0
I would like to create a new column 'Sport'
which is applied to each row under the category of course until you reach the next sport. The resulting DataFrame would look like this with the categorical rows removed:
Name Price Sport
1 Bat 1.0 Baseball
2 Glove 2.0 Baseball
3 Shoes 3.0 Soccer
4 Ball 4.0 Soccer
5 Helmet 6.0 Football
I was thinking of creating a new column 'Sport'
which is the value of Name
if Price
is not NaN
otherwise NaN
. Then use a ffill
or something and then drop the NaN
price rows?
CodePudding user response:
Try mask
the notna
, then ffill
to get the correct Sport
:
s = dummy_df['Price'].notna()
dummy_df.assign(Sport=dummy_df['Name'].mask(s).ffill()).loc[s]
Output:
Name Price Sport
1 Bat 1.0 Baseball
2 Glove 2.0 Baseball
4 Shoes 3.0 Soccer
5 Ball 4.0 Soccer
7 Helmet 6.0 Football
CodePudding user response:
dummy_df["Sport"] = dummy_df.groupby(dummy_df.Price.isna().cumsum()).Name.transform("first")
dummy_df[dummy_df.Price.notna()]
# Name Price Sport
# 1 Bat 1.0 Baseball
# 2 Glove 2.0 Baseball
# 4 Shoes 3.0 Soccer
# 5 Ball 4.0 Soccer
# 7 Helmet 6.0 Football