Home > Enterprise >  How to extract group categories when the category names are mixed with the data
How to extract group categories when the category names are mixed with the data

Time:12-07

I have the following data structure. In which the category names are mixed with the product names

df = pd.DataFrame(data={'name':['Category A', 'Subcategory A.A', 'Product A', 'Product B', 'Category B', 'Product C'],'values':["", "", 1,2,"", 3]})

name                values
Category A  
Subcategory A.A 
Product A           1
Product B           2
Category B  
Product C           3

Every entry in the name column which does not have a value is a category name.

Is there any way to convert the pandas DataFrame into the following structure?

name        values      category
Product A   1           Category A, Subcategory A.A
Product B   2           Category A, Subcategory A.A
Product C   3           Category B

Any help is appreciated.

CodePudding user response:

Use cumsum to create a custom grouping for the category blocks and use groupby.apply to return the non-category rows new category column:

# create custom grouping per category block
newgroup = df['values'].eq('') & df['values'].shift().ne('')
groups = newgroup.cumsum()

# given group g, return subframe of non-category rows   category name
def categorize(g):
    is_category = g['values'].eq('')
    category = ', '.join(g.loc[is_category, 'name']) # join category names by comma
    return g.loc[~is_category].assign(category=category) # return non-category rows with new category column

# apply custom function to each group
df.groupby(groups).apply(categorize).droplevel(0)

Output:

        name values                     category
2  Product A      1  Category A, Subcategory A.A
3  Product B      2  Category A, Subcategory A.A
5  Product C      3                   Category B

Details

  1. Each category block starts when the current values is empty and previous values is not, so we can generate pseudo-groups using cumsum. Here the groups are shown as a column just for visual reference:

    newgroup = df['values'].eq('') & df['values'].shift().ne('')
    groups = newgroup.cumsum()
    
    #               name  values  groups
    # 0       Category A               1
    # 1  Subcategory A.A               1
    # 2        Product A       1       1
    # 3        Product B       2       1
    # 4       Category B               2
    # 5        Product C       3       2
    
  2. Within each group, get the category string by joining the name from all the category rows. Then we can return the non-category rows after we assign the new category name:

    def categorize(g):
        is_category = g['values'].eq('')
        category = ', '.join(g.loc[is_category, 'name']) # join category rows by comma
        return g.loc[~is_category].assign(category=category) # return non-category rows with new category column
    
  3. Pass this function to groupby.apply:

    df.groupby(groups).apply(categorize).droplevel(0)
    
    #         name values                     category
    # 2  Product A      1  Category A, Subcategory A.A
    # 3  Product B      2  Category A, Subcategory A.A
    # 5  Product C      3                   Category B
    

CodePudding user response:

I'm not sure if there's a very pandas-esque way to deal with this, so I've come up with a simple Python solution:

new_data = {'name':[], 'values': [], 'category': []}
lasts = {}
for idx, row in df.iterrows():
    tp, val = row['name'].split(' ')
    if row['values'] == '':
        lasts[tp] = val
        # Reset the subcategory if a new category is encountered
        if tp == 'Category' and 'Subcategory' in lasts:
            del lasts['Subcategory']
    else:
        new_data['category'].append(', '.join(f'{k} {v}' for k, v in zip(lasts.keys(), lasts.values())))
        for k in row.index:
            new_data[k].append(row[k])
df = pd.DataFrame(new_data)

Output:

>>> df
        name  values                     category
0  Product A       1  Category A, Subcategory A.A
1  Product B       2  Category A, Subcategory A.A
2  Product C       3                   Category B
  • Related