I have a single dataframe containing info on users from each country in the world:
id first_name last_name gender is_valid country
1 Dianemarie Stormonth Female false Indonesia
2 Joel Probbin Male false Russia
3 Corrie Bage Female true Paraguay
4 Kin Boothman Male true Western Sahara
5 Aylmer McMeanma Male true Ukraine
6 Megan Leppard Female false Poland
7 Sandye Hugle Female true Indonesia
8 Jayson Millier Male false Greece
9 Janot Adicot Female false Brazil
and I have a need to do the following; for each country, get a dataframe of the users in that country, and then split this filtered dataframe into 2 separate ones , with a 50/50 random split.
I think I have gotten the logic down for the filtering and split in a for loop, but am unsure how to keep the outcome of each iteration rather than the just last country. Can anybody help?
my code so far:
for country in df.country.unique():
df_filtered = df.query(f"country == "{country}")
df1 = df_filtered.sample(frac=0.5, random_state=1)
df2 = df_filtered.drop(df1.index)
As you can see , I first make a filtered version of my df, then use the pandas sample
function to get 50% of the data randomly (along with random state for reproducability) , and then get the other 50% of the data using the original filtered df and df1.
The ideal scenario would be to have a bunch of dataframes , 2 per country, e.g. UK_df1, UK_df2, etc....
Would the right way to keep the outcome of my loop be to have an empty list for each country name, and then assign df1 and df2 to that empty list on each iteration? such as
uk_dataframes = [UK_df1, UK_df2]
Thanks !!
CodePudding user response:
It all comes down to what you want the output to be, here I provide three options, either save the files in a .csv
format, save them in their respective lists, or use dictionaries:
split_1_list = []
split_2_list = []
split_1_dict = {}
split_2_dict = {}
for country in df.country.unique():
df_filtered = df[df.country == country]
df_split_1 = frac=0.5, random_state=1)
df_split_2 = df_filtered.drop(df1.index)
# if you want csv:
df_split_1.to_csv(f"{country}_split1.csv")
df_split_2.to_csv(f"{country}_split2.csv")
# if you want lists:
split_1_list.append(df_split_1)
split_2_list.append(df_split_2)
# if you want dictionaries
split_1_dict[country] = df_split_1
split_2_dict[country] = df_split_2
If you want to keep them in variables, then dictionaries might be the most organised way to do this.
CodePudding user response:
Your question asks:
The ideal scenario would be to have a bunch of dataframes , 2 per country, e.g. UK_df1, UK_df2, etc....
Would the right way to keep the outcome of my loop be to have an empty list for each country name, and then assign df1 and df2 to that empty list on each iteration?
Because your output is essentially a collection of pairs of dataframes, one straightforward possibility is to store a list of pairs. This can be done as a list of lists, where the inner lists each have length 2:
dfCountrySplits = []
print(df.country.unique())
for country in df.country.unique():
df_filtered = df.query(f"country == '{country}'")
df1 = df_filtered.sample(frac=0.5, random_state=1)
df2 = df_filtered.drop(df1.index)
dfCountrySplits.append([df1.reset_index(drop=True), df2.reset_index(drop=True)])
[print('', x[0], x[1], sep='\n') for x in dfCountrySplits]
Input:
id first_name last_name gender is_valid country
0 1 Dianemarie Stormonth Female false Indonesia
1 2 Joel Probbin Male false Russia
2 3 Corrie Bage Female true Paraguay
3 4 Kin Boothman Male true Western_Sahara
4 5 Aylmer McMeanma Male true Ukraine
5 6 Megan Leppard Female false Poland
6 7 Sandye Hugle Female true Indonesia
7 8 Jayson Millier Male false Greece
8 9 Janot Adicot Female false Brazil
9 10 Alice Smith Female true Indonesia
10 11 Bob Jones Male false Russia
11 12 Alice Smith2 Female true Paraguay
12 13 Bob Jones2 Male false Western_Sahara
13 14 Alice Smith3 Female true Ukraine
14 15 Bob Jones3 Male false Poland
15 16 Alice Smith4 Female true Indonesia
16 17 Bob Jones4 Male false Greece
17 18 Alice Smith5 Female true Brazil
Output:
id first_name last_name gender is_valid country
0 16 Alice Smith4 Female true Indonesia
1 10 Alice Smith Female true Indonesia
id first_name last_name gender is_valid country
0 1 Dianemarie Stormonth Female false Indonesia
1 7 Sandye Hugle Female true Indonesia
id first_name last_name gender is_valid country
0 2 Joel Probbin Male false Russia
id first_name last_name gender is_valid country
0 11 Bob Jones Male false Russia
id first_name last_name gender is_valid country
0 3 Corrie Bage Female true Paraguay
id first_name last_name gender is_valid country
0 12 Alice Smith2 Female true Paraguay
id first_name last_name gender is_valid country
0 4 Kin Boothman Male true Western_Sahara
id first_name last_name gender is_valid country
0 13 Bob Jones2 Male false Western_Sahara
id first_name last_name gender is_valid country
0 5 Aylmer McMeanma Male true Ukraine
id first_name last_name gender is_valid country
0 14 Alice Smith3 Female true Ukraine
id first_name last_name gender is_valid country
0 6 Megan Leppard Female false Poland
id first_name last_name gender is_valid country
0 15 Bob Jones3 Male false Poland
id first_name last_name gender is_valid country
0 8 Jayson Millier Male false Greece
id first_name last_name gender is_valid country
0 17 Bob Jones4 Male false Greece
id first_name last_name gender is_valid country
0 9 Janot Adicot Female false Brazil
id first_name last_name gender is_valid country
0 18 Alice Smith5 Female true Brazil
Note that I expanded the test input so that each 50% split of rows by country had at least 1 element, just so the test results would not be dominated by pairs with one empty dataframe.
CodePudding user response:
If you want a convenient obejct you could use a dictionary to store the dataframes with the country names as keys. Someting like this
country_dataframes = dict()
for country in df.country.unique():
df_filtered = df[df.country == country]
df1 = df_filtered.sample(frac=0.5, random_state=1)
df2 = df_filtered.drop(df1.index)
country_dataframes[country] = [df1, df2]
print(country_dataframes["Brazil"][0]) # gets you the df1
print(country_dataframes["Brazil"][1]) # gets you the df2
two other options would be:
- save the datframes to a csv and read it later when you need it (use
df1.to_csv(f'yourFilename"{country}".csv'
for this) - save the dataframes in a two dimensional list (
[[df1_brazil, df2_brazil],[df1_germany, df2_germany]...]
)