Home > Enterprise >  Split a dataframe according to priorities and specific number of rows
Split a dataframe according to priorities and specific number of rows

Time:05-23

I want to split a dataframe into 3 new dataframes according to a priority column. My dataframe is as follows:

      City         Priority
0   New York         3       
1   Paris            1
2   Boston           7
3   La Habana        6
4   Bilbao           10
5   Roma             2
6   Barcelona        1
7   Bruselas         8
8   Tokyo            7
9   Caracas          11

There are 3 types of priorities:

  • Priority 7 to 9
  • Priority 1 to 6
  • Priority from 10 to 11

The idea is to divide this dataframe in 3 with the following structure and that in turn would be ordered by the value of its priority:

  • Dataframe with 3 rows of priority from 7 to 9
  • Dataframe with 5 rows of priority from 1 to 6
  • Dataframe with 2 rows of the priority from 10 to 11.

The result would be as follows:

Dataframe 1:

      City         Priority
0   Boston           7       
1   Tokyo            7
2   Bruselas         8

Dataframe 2:

      City         Priority
0   Paris            1       
1   Barcelona        1
2   Roma             2
3   New York         3
4   La Habana        6

Dataframe 3:

      City         Priority
0   Bilbao           10       
1   Caracas          11

I think it is important to note that if there were no rows of priority 7 to 9, the priority numbers that would be chosen for that dataframe of 3 would be 10, if not 11, if not 1, if not 2, etc. The same with the rest of the dataframes and priorities: 1, 2, 3, 4, etc for the second one and 10, 11, 1, 2, 3, etc for the third one.

Also, if there were 4 values such as 7, 7, 7, 8, only rows 7, 7, 7 would appear in the 3-row Dataframe and the row with value 8 would be in Dataframe 2.

Likewise, I think it is also important to say that in that iteration, when the first dataframe of 3 rows is generated, they should be "extracted" from the original dataframe so that they are not taken into account when generating the other dataframes. I hope I have explained myself well and that someone can help me. Best regards and thanks!

CodePudding user response:

IIUC this should work as expected:

(1) you create a column bin_Priority which applies each row to the right bin, the labels of the bins are the priority in which order to look for them.

(2) sort_values on bin_Priority, then in each bin on Priority.

(3) split your df into 3 df's, the 1st with 3 rows, the 2nd with 2 rows and the 3rd with 5 rows. If values of Priority groups are missing it chooses the right values because they are ordered right.

Please let me know if that is what you are searching for.

df = pd.DataFrame({
    'City': ['New York','Paris','Boston','La Habana','Bilbao','Roma','Barcelona','Bruselas','Tokyo','Caracas'],
    'Priority': [3, 1, 7, 6, 10, 2, 1, 8, 7, 11]
})

#(1)
df['bin_Priority'] = pd.cut(df['Priority'], bins=[0,6,9,11], labels=[3, 1, 2]).to_numpy()

#(2)
ordered_priority_df = df.sort_values(by=['bin_Priority', 'Priority'])

#(3)
out = np.split(ordered_priority_df, [3,5])

print(df, ordered_priority_df, *out, sep='\n\n')

#df
        City  Priority  bin_Priority
0   New York         3             3
1      Paris         1             3
2     Boston         7             1
3  La Habana         6             3
4     Bilbao        10             2
5       Roma         2             3
6  Barcelona         1             3
7   Bruselas         8             1
8      Tokyo         7             1
9    Caracas        11             2

#ordered_priority_df
        City  Priority  bin_Priority
2     Boston         7             1
8      Tokyo         7             1
7   Bruselas         8             1
4     Bilbao        10             2
9    Caracas        11             2
1      Paris         1             3
6  Barcelona         1             3
5       Roma         2             3
0   New York         3             3
3  La Habana         6             3

# out[0]
       City  Priority  bin_Priority
2    Boston         7             1
8     Tokyo         7             1
7  Bruselas         8             1

# out[1]
      City  Priority  bin_Priority
4   Bilbao        10             2
9  Caracas        11             2

# out[2]
        City  Priority  bin_Priority
1      Paris         1             3
6  Barcelona         1             3
5       Roma         2             3
0   New York         3             3
3  La Habana         6             3

Here is an example where I changed the value of Paris from 1 to 7. value 8 (which should be in the 1st df) ends in the 2nd df and same with value 11 (from 2nd to 3rd).

df = pd.DataFrame({
    'City': ['New York','Paris','Boston','La Habana','Bilbao','Roma','Barcelona','Bruselas','Tokyo','Caracas'],
    'Priority': [3, 7, 7, 6, 10, 2, 1, 8, 7, 11]
})

df['bin_Priority'] = pd.cut(df['Priority'], bins=[0,6,9,11], labels=[3, 1, 2]).to_numpy()

ordered_priority_df = df.sort_values(by=['bin_Priority', 'Priority'])

out = np.split(ordered_priority_df, [3,5])

print(df, *out, sep='\n\n')

        City  Priority  bin_Priority
0   New York         3             3
1      Paris         7             1
2     Boston         7             1
3  La Habana         6             3
4     Bilbao        10             2
5       Roma         2             3
6  Barcelona         1             3
7   Bruselas         8             1
8      Tokyo         7             1
9    Caracas        11             2

     City  Priority  bin_Priority
1   Paris         7             1
2  Boston         7             1
8   Tokyo         7             1

       City  Priority  bin_Priority
7  Bruselas         8             1
4    Bilbao        10             2

        City  Priority  bin_Priority
9    Caracas        11             2
6  Barcelona         1             3
5       Roma         2             3
0   New York         3             3
3  La Habana         6             3
  • Related