get specific number of data from values in a column in pandas-CodePudding

In order to prevent my machine learning algorithm from tending to a certain data, I want to reduce the frequency differences in my dataset, which is a pandas table,

for example, in column X;

A value is 1500 times
B value is 3000 times
C value is 1300 times

Is there a way to get 1250 of them all?

CodePudding user response：

can you try this:

df2=pd.concat(df[df['X']=='A'][:1250],df[df['X']=='B'][:1250],df[df['X']=='C'][:1250])

CodePudding user response：

A solution assuming you may have an unknown number of unique values:

import pandas as pd

#Creating a Panda dafatframme with the number of elements
d = {'X': 1500*["A"] 3000*["B"] 1300*["C"]}
df = pd.DataFrame(data=d)

#Create a dictionnary containing 1 dataframe for each unique value
dfDict = dict(iter(df.groupby('X')))   

#Keep only the first n values for each and add them to filtered dataframe
for unique_val in dfDict:
    dfDict[unique_val] = dfDict[unique_val][:1250]
    filetered = pd.concat(dfDict, ignore_index=True)