I have a pandas data frame which has 2 columns- Query and Category. In the category column I have 121 unique categories. I need to replicate the rows of each category by x times where x is diiference between the maximum count a category has and the current count of each category, and store the result in a new data frame. Example: for the category alpha I have 5 different queries, I need to replicate this query for 319 times to make it equal to the max_count of 324. (i.e. 324-5 = 319). Code I have as of now:
import pandas as pd
import numpy as np
df1 = pd.read_csv('Query_cat_121.csv')
print(len(df1['Category'].unique())
list_of_cats = []
for i in set(df['Category']):
list_of_cats.append(i)
list_of_differences = []
print(df['Category'].value_counts().nlargest(1))
max_count = 797 //based on previous line
df2 = pd.DataFrame()
// want to append modified dataframe here
A Sample of the data frame I have:
Query Category
apple fruits
banana fruits
mango fruits
bat animal
cat animal
rat animal
lion animal
potato veggie
A Sample of the data frame I want:
Query Category
apple fruits
banana fruits
mango fruits
apple fruits
bat animal
cat animal
rat animal
lion animal
potato veggie
potato veggie
potato veggie
potato veggie
This is the output since I want all other categories to be replicated as many times as to reach the max count of a category which is 4 here.
CodePudding user response:
You can use pandas.dataframe.groupby
and then replicate the data based on your logic and choose the max among them.
dict_ = {
'Query' : ['apple', 'banana', 'mango', 'bat', 'cat', 'rat', 'lion', 'potato', 'london', 'new jersey'],
'Category': ['fruits', 'fruits', 'fruits', 'animal', 'animal', 'animal', 'animal', 'veggie', 'place', 'place'],
}
df = pd.DataFrame(dict_)
df
Query Category
0 apple fruits
1 banana fruits
2 mango fruits
3 bat animal
4 cat animal
5 rat animal
6 lion animal
7 potato veggie
8 london place
9 new jersey place
rep_val= df.groupby('Category').size().max()
df.groupby('Category').apply(lambda d: pd.concat(([d]*math.ceil(rep_val/d.shape[0]))).head(rep_val)).reset_index(drop=True)
which gives us the expected output :
Query Category
0 bat animal
1 cat animal
2 rat animal
3 lion animal
4 apple fruits
5 banana fruits
6 mango fruits
7 apple fruits
8 london place
9 new jersey place
10 london place
11 new jersey place
12 potato veggie
13 potato veggie
14 potato veggie
15 potato veggie
CodePudding user response:
It isn't clear what you are asking, but I think you may be asking this:
Given a dataframe D with columns Query
and Category
, I want to add X new copies of every row, where X depends on the value of Category
. The value of X for category q = C_M - C_q, where C_M is the cardinality of the highest-count category and C_q is the cardinality of q in the data frame.
A simple way to get started on this is:
difference_ser = df1['Category'].value_counts() - df1['Category'].value_counts().max()
difference_dict = difference_ser.to_dict()
And now you have a dictionary that tells you how many extra copies of each line to add.
Then you can use repeat
to create new dataframes for each category.
my_dfs_to_add = []
for c in list(df1['Category'].unique()):
tiling_df = df1[df1['Category'] = c]
df_to_add = pd.Dataframe({
'Query': tiling_df['Query'].repeat(difference_dict[c])
'Category': tiling_df['Category'].repeat(difference_dict[c])
}
)
my_dfs_to_add.append(df_to_add)
Then just concatenate all the dfs you made and append it to your original df.