Home > database >  How to filter, split and save multiple dataframes in a single for loop?
How to filter, split and save multiple dataframes in a single for loop?

Time:06-09

I have a single dataframe containing info on users from each country in the world:

id  first_name  last_name   gender  is_valid    country
1   Dianemarie  Stormonth   Female  false   Indonesia
2   Joel    Probbin Male    false   Russia
3   Corrie  Bage    Female  true    Paraguay
4   Kin Boothman    Male    true    Western Sahara
5   Aylmer  McMeanma    Male    true    Ukraine
6   Megan   Leppard Female  false   Poland
7   Sandye  Hugle   Female  true    Indonesia
8   Jayson  Millier Male    false   Greece
9   Janot   Adicot  Female  false   Brazil

and I have a need to do the following; for each country, get a dataframe of the users in that country, and then split this filtered dataframe into 2 separate ones , with a 50/50 random split.

I think I have gotten the logic down for the filtering and split in a for loop, but am unsure how to keep the outcome of each iteration rather than the just last country. Can anybody help?

my code so far:

for country in df.country.unique():
    df_filtered = df.query(f"country == "{country}")
    df1 = df_filtered.sample(frac=0.5, random_state=1)
    df2 = df_filtered.drop(df1.index)

As you can see , I first make a filtered version of my df, then use the pandas sample function to get 50% of the data randomly (along with random state for reproducability) , and then get the other 50% of the data using the original filtered df and df1.

The ideal scenario would be to have a bunch of dataframes , 2 per country, e.g. UK_df1, UK_df2, etc....

Would the right way to keep the outcome of my loop be to have an empty list for each country name, and then assign df1 and df2 to that empty list on each iteration? such as

uk_dataframes = [UK_df1, UK_df2] 

Thanks !!

CodePudding user response:

It all comes down to what you want the output to be, here I provide three options, either save the files in a .csv format, save them in their respective lists, or use dictionaries:

split_1_list = []
split_2_list = []
split_1_dict = {}
split_2_dict = {}
for country in df.country.unique():
    df_filtered = df[df.country == country]
    df_split_1 = frac=0.5, random_state=1)
    df_split_2 = df_filtered.drop(df1.index)
    # if you want csv:
    df_split_1.to_csv(f"{country}_split1.csv")
    df_split_2.to_csv(f"{country}_split2.csv")
    # if you want lists:
    split_1_list.append(df_split_1)
    split_2_list.append(df_split_2)
    # if you want dictionaries
    split_1_dict[country] = df_split_1
    split_2_dict[country] = df_split_2

If you want to keep them in variables, then dictionaries might be the most organised way to do this.

CodePudding user response:

Your question asks:

The ideal scenario would be to have a bunch of dataframes , 2 per country, e.g. UK_df1, UK_df2, etc....

Would the right way to keep the outcome of my loop be to have an empty list for each country name, and then assign df1 and df2 to that empty list on each iteration?

Because your output is essentially a collection of pairs of dataframes, one straightforward possibility is to store a list of pairs. This can be done as a list of lists, where the inner lists each have length 2:

dfCountrySplits = []
print(df.country.unique())
for country in df.country.unique():
    df_filtered = df.query(f"country == '{country}'")
    df1 = df_filtered.sample(frac=0.5, random_state=1)
    df2 = df_filtered.drop(df1.index)
    dfCountrySplits.append([df1.reset_index(drop=True), df2.reset_index(drop=True)])

[print('', x[0], x[1], sep='\n') for x in dfCountrySplits]

Input:

    id  first_name  last_name  gender is_valid         country
0    1  Dianemarie  Stormonth  Female    false       Indonesia
1    2        Joel    Probbin    Male    false          Russia
2    3      Corrie       Bage  Female     true        Paraguay
3    4         Kin   Boothman    Male     true  Western_Sahara
4    5      Aylmer   McMeanma    Male     true         Ukraine
5    6       Megan    Leppard  Female    false          Poland
6    7      Sandye      Hugle  Female     true       Indonesia
7    8      Jayson    Millier    Male    false          Greece
8    9       Janot     Adicot  Female    false          Brazil
9   10       Alice      Smith  Female     true       Indonesia
10  11         Bob      Jones    Male    false          Russia
11  12       Alice     Smith2  Female     true        Paraguay
12  13         Bob     Jones2    Male    false  Western_Sahara
13  14       Alice     Smith3  Female     true         Ukraine
14  15         Bob     Jones3    Male    false          Poland
15  16       Alice     Smith4  Female     true       Indonesia
16  17         Bob     Jones4    Male    false          Greece
17  18       Alice     Smith5  Female     true          Brazil

Output:

   id first_name last_name  gender is_valid    country
0  16      Alice    Smith4  Female     true  Indonesia
1  10      Alice     Smith  Female     true  Indonesia
  id  first_name  last_name  gender is_valid    country
0  1  Dianemarie  Stormonth  Female    false  Indonesia
1  7      Sandye      Hugle  Female     true  Indonesia

  id first_name last_name gender is_valid country
0  2       Joel   Probbin   Male    false  Russia
   id first_name last_name gender is_valid country
0  11        Bob     Jones   Male    false  Russia

  id first_name last_name  gender is_valid   country
0  3     Corrie      Bage  Female     true  Paraguay
   id first_name last_name  gender is_valid   country
0  12      Alice    Smith2  Female     true  Paraguay

  id first_name last_name gender is_valid         country
0  4        Kin  Boothman   Male     true  Western_Sahara
   id first_name last_name gender is_valid         country
0  13        Bob    Jones2   Male    false  Western_Sahara

  id first_name last_name gender is_valid  country
0  5     Aylmer  McMeanma   Male     true  Ukraine
   id first_name last_name  gender is_valid  country
0  14      Alice    Smith3  Female     true  Ukraine

  id first_name last_name  gender is_valid country
0  6      Megan   Leppard  Female    false  Poland
   id first_name last_name gender is_valid country
0  15        Bob    Jones3   Male    false  Poland

  id first_name last_name gender is_valid country
0  8     Jayson   Millier   Male    false  Greece
   id first_name last_name gender is_valid country
0  17        Bob    Jones4   Male    false  Greece

  id first_name last_name  gender is_valid country
0  9      Janot    Adicot  Female    false  Brazil
   id first_name last_name  gender is_valid country
0  18      Alice    Smith5  Female     true  Brazil

Note that I expanded the test input so that each 50% split of rows by country had at least 1 element, just so the test results would not be dominated by pairs with one empty dataframe.

CodePudding user response:

If you want a convenient obejct you could use a dictionary to store the dataframes with the country names as keys. Someting like this

country_dataframes = dict()

for country in df.country.unique():
    df_filtered = df[df.country == country] 
    df1 = df_filtered.sample(frac=0.5, random_state=1)
    df2 = df_filtered.drop(df1.index)
    country_dataframes[country] = [df1, df2]

print(country_dataframes["Brazil"][0]) # gets you the df1
print(country_dataframes["Brazil"][1]) # gets you the df2

two other options would be:

  • save the datframes to a csv and read it later when you need it (use df1.to_csv(f'yourFilename"{country}".csv' for this)
  • save the dataframes in a two dimensional list ([[df1_brazil, df2_brazil],[df1_germany, df2_germany]...])
  • Related