How to replicate rows of pandas upto x counts, where x is the difference of max count and the count-CodePudding

I have a pandas data frame which has 2 columns- Query and Category. In the category column I have 121 unique categories. I need to replicate the rows of each category by x times where x is diiference between the maximum count a category has and the current count of each category, and store the result in a new data frame. Example: for the category alpha I have 5 different queries, I need to replicate this query for 319 times to make it equal to the max_count of 324. (i.e. 324-5 = 319). Code I have as of now:

import pandas as pd
import numpy as np

df1 = pd.read_csv('Query_cat_121.csv')
print(len(df1['Category'].unique())
list_of_cats = []
for i in set(df['Category']):
    list_of_cats.append(i)

list_of_differences = []
print(df['Category'].value_counts().nlargest(1))
max_count = 797 //based on previous line
df2 = pd.DataFrame()
// want to append modified dataframe here

A Sample of the data frame I have:

Query    Category
apple    fruits
banana   fruits
mango    fruits
bat      animal
cat      animal
rat      animal
lion     animal
potato   veggie

A Sample of the data frame I want:

Query    Category
apple    fruits
banana   fruits
mango    fruits
apple    fruits
bat      animal
cat      animal
rat      animal
lion     animal
potato   veggie
potato   veggie
potato   veggie
potato   veggie

This is the output since I want all other categories to be replicated as many times as to reach the max count of a category which is 4 here.

CodePudding user response：

You can use pandas.dataframe.groupby and then replicate the data based on your logic and choose the max among them.

dict_ = {
    'Query' : ['apple', 'banana', 'mango', 'bat', 'cat', 'rat', 'lion', 'potato', 'london', 'new jersey'],
    'Category': ['fruits', 'fruits', 'fruits', 'animal', 'animal', 'animal', 'animal', 'veggie', 'place', 'place'],
    }
df = pd.DataFrame(dict_)

df
        Query Category
0       apple   fruits
1      banana   fruits
2       mango   fruits
3         bat   animal
4         cat   animal
5         rat   animal
6        lion   animal
7      potato   veggie
8      london    place
9  new jersey    place

rep_val= df.groupby('Category').size().max()
df.groupby('Category').apply(lambda d: pd.concat(([d]*math.ceil(rep_val/d.shape[0]))).head(rep_val)).reset_index(drop=True)

which gives us the expected output :

         Query Category
0          bat   animal
1          cat   animal
2          rat   animal
3         lion   animal
4        apple   fruits
5       banana   fruits
6        mango   fruits
7        apple   fruits
8       london    place
9   new jersey    place
10      london    place
11  new jersey    place
12      potato   veggie
13      potato   veggie
14      potato   veggie
15      potato   veggie

CodePudding user response：

It isn't clear what you are asking, but I think you may be asking this:

Given a dataframe D with columns Query and Category, I want to add X new copies of every row, where X depends on the value of Category. The value of X for category q = C_M - C_q, where C_M is the cardinality of the highest-count category and C_q is the cardinality of q in the data frame.

A simple way to get started on this is:

difference_ser = df1['Category'].value_counts() - df1['Category'].value_counts().max()

difference_dict = difference_ser.to_dict()

And now you have a dictionary that tells you how many extra copies of each line to add.

Then you can use repeat to create new dataframes for each category.

my_dfs_to_add = []
for c in list(df1['Category'].unique()):
    tiling_df = df1[df1['Category'] = c]
    df_to_add = pd.Dataframe({
         'Query': tiling_df['Query'].repeat(difference_dict[c])
         'Category': tiling_df['Category'].repeat(difference_dict[c])
        }
     )
     my_dfs_to_add.append(df_to_add)

Then just concatenate all the dfs you made and append it to your original df.