In order to prevent my machine learning algorithm from tending to a certain data, I want to reduce the frequency differences in my dataset, which is a pandas table,
for example, in column X;
- A value is 1500 times
- B value is 3000 times
- C value is 1300 times
Is there a way to get 1250 of them all?
CodePudding user response:
can you try this:
df2=pd.concat(df[df['X']=='A'][:1250],df[df['X']=='B'][:1250],df[df['X']=='C'][:1250])
CodePudding user response:
A solution assuming you may have an unknown number of unique values:
import pandas as pd
#Creating a Panda dafatframme with the number of elements
d = {'X': 1500*["A"] 3000*["B"] 1300*["C"]}
df = pd.DataFrame(data=d)
#Create a dictionnary containing 1 dataframe for each unique value
dfDict = dict(iter(df.groupby('X')))
#Keep only the first n values for each and add them to filtered dataframe
for unique_val in dfDict:
dfDict[unique_val] = dfDict[unique_val][:1250]
filetered = pd.concat(dfDict, ignore_index=True)