I have the following data structure. In which the category names are mixed with the product names
df = pd.DataFrame(data={'name':['Category A', 'Subcategory A.A', 'Product A', 'Product B', 'Category B', 'Product C'],'values':["", "", 1,2,"", 3]})
name values
Category A
Subcategory A.A
Product A 1
Product B 2
Category B
Product C 3
Every entry in the name
column which does not have a value is a category name.
Is there any way to convert the pandas DataFrame into the following structure?
name values category
Product A 1 Category A, Subcategory A.A
Product B 2 Category A, Subcategory A.A
Product C 3 Category B
Any help is appreciated.
CodePudding user response:
Use cumsum
to create a custom grouping for the category blocks and use groupby.apply
to return the non-category rows new category column:
# create custom grouping per category block
newgroup = df['values'].eq('') & df['values'].shift().ne('')
groups = newgroup.cumsum()
# given group g, return subframe of non-category rows category name
def categorize(g):
is_category = g['values'].eq('')
category = ', '.join(g.loc[is_category, 'name']) # join category names by comma
return g.loc[~is_category].assign(category=category) # return non-category rows with new category column
# apply custom function to each group
df.groupby(groups).apply(categorize).droplevel(0)
Output:
name values category
2 Product A 1 Category A, Subcategory A.A
3 Product B 2 Category A, Subcategory A.A
5 Product C 3 Category B
Details
Each category block starts when the current
values
is empty and previousvalues
is not, so we can generate pseudo-groups usingcumsum
. Here the groups are shown as a column just for visual reference:newgroup = df['values'].eq('') & df['values'].shift().ne('') groups = newgroup.cumsum() # name values groups # 0 Category A 1 # 1 Subcategory A.A 1 # 2 Product A 1 1 # 3 Product B 2 1 # 4 Category B 2 # 5 Product C 3 2
Within each group, get the category string by joining the
name
from all the category rows. Then we can return the non-category rows after weassign
the new category name:def categorize(g): is_category = g['values'].eq('') category = ', '.join(g.loc[is_category, 'name']) # join category rows by comma return g.loc[~is_category].assign(category=category) # return non-category rows with new category column
Pass this function to
groupby.apply
:df.groupby(groups).apply(categorize).droplevel(0) # name values category # 2 Product A 1 Category A, Subcategory A.A # 3 Product B 2 Category A, Subcategory A.A # 5 Product C 3 Category B
CodePudding user response:
I'm not sure if there's a very pandas-esque way to deal with this, so I've come up with a simple Python solution:
new_data = {'name':[], 'values': [], 'category': []}
lasts = {}
for idx, row in df.iterrows():
tp, val = row['name'].split(' ')
if row['values'] == '':
lasts[tp] = val
# Reset the subcategory if a new category is encountered
if tp == 'Category' and 'Subcategory' in lasts:
del lasts['Subcategory']
else:
new_data['category'].append(', '.join(f'{k} {v}' for k, v in zip(lasts.keys(), lasts.values())))
for k in row.index:
new_data[k].append(row[k])
df = pd.DataFrame(new_data)
Output:
>>> df
name values category
0 Product A 1 Category A, Subcategory A.A
1 Product B 2 Category A, Subcategory A.A
2 Product C 3 Category B