I have a dataframe that has an animals column with different animals (say ["cat", "dog", "lion"]) as rows and a value corresponding to each animal. There are 10 unique animals and 50 entries of each. The animals are not in any particular order.
I want to split the dataframe into two with one containing 40 of each animal and another containing 10 of each animal. That is one dataframe should contain 40 cats, 40 dogs etc and the other dataframe with 10 cats, 10 dogs etc.
Any help would be greatly appreciated.
I have tried to sort by unique values but it did not work. I am not very familiar with Pandas yet and this is the first time I am using it.
Edit:
Adding a better example of what I need
Animal | value |
---|---|
dog | 12 |
cat | 14 |
dog | 10 |
cat | 40 |
dog | 90 |
dog | 80 |
cat | 30 |
dog | 20 |
cat | 20 |
cat | 23 |
I want to separate this into 2 data frames. In this example the first dataframe would have 3 of each animal and the other one would have 2 of each animal.
Animal | value |
---|---|
dog | 12 |
dog | 10 |
dog | 90 |
cat | 14 |
cat | 40 |
cat | 30 |
Animal | value |
---|---|
dog | 80 |
dog | 20 |
cat | 20 |
cat | 23 |
CodePudding user response:
Does this work? df.groupby('animal', group_keys=False).apply(lambda x: x.sample(frac=0.2))
You could then remove these rows from your original dataframe to create the one with 40 of each animal.
CodePudding user response:
You can get the two dataframes the following way:
df_big = df.groupby('category').apply(lambda x: x.sample(frac=0.8)).reset_index('category', drop=True)
df_small = df.drop(df_big.index)
CodePudding user response:
We can construct a custom function that returns two DataFrames based on the requirement that one should have 40 entries of each animal and the other should contain 10, and then apply the function to the DataFrame, grouped by the 'Animal' column:
def split_df(df):
df1 = df[df['value'] < 30] # select rows where 'value' is less than 30
df2 = df[df['value'] >= 30] # select rows where 'value' is greater than or equal to 30
return df1, df2
# apply the custom function to the DataFrame, grouped by 'Animal'
df1, df2 = df.groupby('Animal').apply(split_df)
print(df1)
print(df2)
The two DataFrames that result will be printed to the console as a result. In this example, there will be four entries for each animal in the first DataFrame (df1), and the final 10 items for each animal will be in the second DataFrame (df2) (1 entry for each animal in this example).
CodePudding user response:
Pandas is really powerfull as you can see from @jmendes16 proposal.
What you not mentiin and should think about is, if you want 40 arbritary picked values or the first/last etc. Additionally is the final order important.
If you want to get familar with pandas you can try to do it step by step, by selecting parts of it and combine them. Eg. if you want to get the fourty first dogs, you can do:
df_40 = df[df.Animal == "dog"].iloc[0:40]
df_10 = df[df.Animal == "dog"].iloc[40:50]
Edit: That is not an efficient, but rather educational solution ;).