Home > OS >  Pandas: Reorder a data frame based on conditions
Pandas: Reorder a data frame based on conditions

Time:11-22

I would like to reorder my dataframe based on certain conditions

My original dataframe looks like

Merchant name         original_rnk
Boohoo                1
PRETTYLITTLETHING     2
ASOS US               3
PRINCESS POLLY        4
URBAN OUTFITTERS      5
KIM ONO               6

And there is a reference dataframe that has some merchant information

Merchant name         order_cnt     profit     epc      
Boohoo                200           30         0.6
PRETTYLITTLETHING     100           -60        -0.4
ASOS US               50            100        1.0
PRINCESS POLLY        80            50         0.8
URBAN OUTFITTERS      120           -20        -0.1
KIM ONO               500           90         0.7

I would like to give a new rank to these merchant based on their epc if their order_cnt >= 100 and profit >=0. The first merchant will always be the first no matter how much its order_cnt and profit are, but for the rest whose order_cnt <100 or profit <0, r their original order.

So my desired output is

Merchant name         new_rnk     original_rnk
Boohoo                1           1
PRETTYLITTLETHING     3           2
ASOS US               4           3
PRINCESS POLLY        5           4
URBAN OUTFITTERS      6           5
KIM ONO               2           6

CodePudding user response:

Using the data provided in the question:

info = pd.DataFrame({
    'Merchant name': ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM ONO'],
    'order_cnt': [200, 100, 50, 80, 120, 500],
    'profit': [30, -60, 100, 50, -20, 90],
    'epc': [0.6, -0.4, 1.0, 0.8, -0.1, 0.7]
})

Split the data into the first row (head), the rows that satisfy the condition (mask) and the rows that don't (pos and neg):

head = info.head(1)
tail = info.iloc[1:]
mask = tail.eval('order_cnt >= 100 and profit >= 0')
pos = tail[mask]
neg = tail[~mask]

Sort the positive rows using the desired criteria (epc) and concatenate the three partitions back together:

df = pd.concat([head, pos.sort_values('epc', ascending=False), neg])

To get output as presented in the original question (with both the original and the new ranks and sorted by the original rank) add these lines:

df['new_rank'] = range(1, 7)
df['original_rank'] = df['Merchant name'].map(ranks.set_index('Merchant name')['original_rnk'])
df.sort_values('original_rank')[['Merchant name', 'new_rank', 'original_rank']]

where ranks is the "original data frame":

ranks = pd.DataFrame({
    'Merchant name': ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM ONO'],
    'original_rnk': range(1, 7)
})

CodePudding user response:

This can be done as follows:

  1. Initializing your first DataFrame with the name: df
import pandas as pd
import numpy as np
df = pd.DataFrame({
    'Merchant name' : ['Boohoo','PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM ONO', 'addedMerch'], 
    'original_rnk': [1, 2, 3, 4, 5, 6, 7] 
})

-note I have added data of an additional merchant named: 'addedMerch' for making sure later the code works

  1. Initializing your second "reference" DataFrame with the name: df2
df2 = pd.DataFrame({
    'Merchant name' : ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM ONO','addedMerch'], 
    'order_cnt': [200, 100, 50, 80, 120, 500, 1000],
    'profit': [30, -60, 100, 50, -20, 90, 1000],
    'epc': [0.6 ,-0.4, 1.0, 0.8, -0.1, 0.7,0.01]
})
  1. Renamed column label from 'Merchant name' to 'merchant_name' in both DataFrames
df.rename(columns={'Merchant name': 'merchant_name'}, inplace=True)
df2.rename(columns={'Merchant name': 'merchant_name'}, inplace=True)
  1. Removing the first merchant temporarily from both DataFrames and storing his data to append it to the the DataFrames again after the operation
#storing it first
firstMerchant = df.query("original_rnk == 1")
firstMerchantInDF2 = df2.query("merchant_name == '{}'".format(firstMerchant.iloc[0]['merchant_name']))
#then removing it
df.drop(firstMerchant.index, axis="rows", inplace=True)
df2.drop(firstMerchantInDF2.index, axis="rows", inplace=True)
  1. Creating a new DataFrame for merchants that will have a newly assigned rank
merchantsWithNewRank = df2.query("order_cnt >= 100 & profit >= 0")
  1. Giving them their new rank according to their 'epc' value
#this is done by first, sorting them by 'epc'
merchantsWithNewRank.sort_values(by="epc", ascending=False)
# and then second, adding a new column: 'new_rnk', assigning it to a range
# starting from 2 (because 1 is for the 'untouched first merchant')
# all the way to (the amount of all merchants with new rank   2)
# -to make sure we are giving exactly each merchant his rank-
merchantsWithNewRank['new_rnk'] = np.arange(2,merchantsWithNewRank.shape[0] 2)
  1. Initialize the column 'new_rnk' in first DataFrame, holding original ranks but pushed n places (each increased by n), where n is the amount of merchants with new rank
df["new_rnk"] = df.original_rnk   merchantsWithNewRank.shape[0]
  1. Assign new ranks for merchants that were qualified to hold a new rank in the original DataFrame (using DF: 'merchantsWithNewRank' as our reference)
for index in merchantsWithNewRank.index:
    df.loc[df.merchant_name==merchantsWithNewRank['merchant_name'][index], 'new_rnk'] = merchantsWithNewRank['new_rnk'][index]
  1. Bring back the first merchant's data and give him his new rank of 1
df = firstMerchant.append(df)
df.loc[df.original_rnk == 1, 'new_rnk'] = 1
  1. Bring him back to df2 too
df2 = firstMerchantInDF2.append(df2)

The result for the above:

-row "addedMerch" is added for double checking the results are correct

[in]: print(df)
[out]:
       merchant_name  original_rnk  new_rnk
0             Boohoo             1      1.0
1  PRETTYLITTLETHING             2      4.0
2            ASOS US             3      5.0
3     PRINCESS POLLY             4      6.0
4   URBAN OUTFITTERS             5      7.0
5            KIM ONO             6      2.0
6         addedMerch             7      3.0

Output for input you provided (without 'addedMerch'):

[in]: print(df)
[out]:
       merchant_name  original_rnk  new_rnk
0             Boohoo             1      1.0
1  PRETTYLITTLETHING             2      3.0
2            ASOS US             3      4.0
3     PRINCESS POLLY             4      5.0
4   URBAN OUTFITTERS             5      6.0
5            KIM ONO             6      2.0

CodePudding user response:

You can use the following code for the desired output;

import pandas as pd

original_rank_frame = pd.DataFrame({'Merchant name': ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM ONO'],
                                    'original_rnk': [1, 2, 3, 4, 5 ,6]})

reference_frame = pd.DataFrame({'Merchant name': ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM ONO'],
                                'order_cnt': [200, 100, 50, 80, 120, 500],
                                'profit': [30, -60, 100, 50, -20, 90],
                                'epc': [0.6, -0.4, 1.0, 0.8, -0.1, 0.7]})

final_table = pd.concat([reference_frame[((reference_frame['order_cnt'] >= 100) & (reference_frame['epc'] >= 0))], 
                         reference_frame[~((reference_frame['order_cnt'] >= 100) & (reference_frame['epc'] >= 0))]], axis=0)
final_table = final_table.reset_index().rename({'index':'original_rnk'}, axis = 'columns').reset_index().rename({'index':'new_rnk'}, axis = 'columns')[['Merchant name', 'new_rnk', 'original_rnk']]
final_table[['new_rnk', 'original_rnk']]  = 1
final_table.sort_values('original_rnk')

Output

       Merchant name  new_rnk  original_rnk
0             Boohoo        1             1
2  PRETTYLITTLETHING        3             2
3            ASOS US        4             3
4     PRINCESS POLLY        5             4
5   URBAN OUTFITTERS        6             5
1            KIM ONO        2             6

Explanation

The first step is to filter the dataframe by desired qualities reference_frame[((reference_frame['order_cnt'] >= 100) & (reference_frame['epc'] >= 0)). Since these are mutually exclusive sets, we can use the negation (~) to get the rest.Then, we concat these two dataframes and extract the original index. We assign a new index by resetting. In the last step, we increment the index values since they start with 0 but the desired output from 1.

  • Related