Home > Back-end >  pandas multiIndex pair only for top and bottom n rows
pandas multiIndex pair only for top and bottom n rows

Time:05-27

I have a pandas dataframe like as shown below

Company,year                                   
T123 Inc Ltd,1990
T124 PVT ltd,1991
T345 Ltd,1990
T789 Pvt.LTd,2001
ABC Limited,1992
ABCDE Ltd,1994
ABC Ltd,1997
ABFE,1987
Tesla ltd,1995
AMAZON Inc,2001
Apple ltd,2003

tf = pd.read_clipboard(sep=',')
tf['Company_copy'] = tf['Company']

I would like to compare each value from tf['company'] against 5 rows above and 5 rows below from its current position.

For ex: I want T123 Inc Ltd to be compared with T124,T345,T789,ABC,ACDE. As T123 is the top row, there is nothing to compare above, we don't generate any comparison pair.

Similarly, if there is noT enough (5 rows) rows to compare, we compare it with whatever we have.

So, I tried the below with the help of this post here

pd.MultiIndex.from_product([tf['Company'].astype(str),tf['Company_copy'].astype(str)]).to_series()

but it produces m*n comparison across all rows of the dataframe

This is because, we have million records and cannot use multiindex_from product to generate all comparisons (which is not useful for us). Just the top and bottom 5 comparison pair for each row is enough. Is there anyway to apply filter multiindex to generate pair based on above and below 5 rows?

I expect my output to be like as below. I show for only one record T123 Inc Ltd.

Company       Company     
  
T123 Inc Ltd  T124 PVT ltd    (T123 Inc Ltd, T124 PVT ltd)
              T345 Ltd            (T123 Inc Ltd, T345 Ltd)
              T789 Pvt.LTd    (T123 Inc Ltd, T789 Pvt.LTd)
              ABC Limited      (T123 Inc Ltd, ABC Limited)
              ABCDE Ltd          (T123 Inc Ltd, ABCDE Ltd)

CodePudding user response:

A possible solution using Series.rolling with center=True and window size 11 (=5 1 5) on Company and then exclude the tuples with the middle row:

from itertools import chain, product

idx = pd.MultiIndex.from_tuples(chain(*(product([row], win)
                         for row, win in zip(tf['Company'], tf['Company'].rolling(11, min_periods=1, center=True))))).to_series()
        
idx = idx[idx.index.get_level_values(0) != idx.index.get_level_values(1)]

Example results:

print(idx['T123 Inc Ltd'])
T124 PVT ltd    (T123 Inc Ltd, T124 PVT ltd)
T345 Ltd            (T123 Inc Ltd, T345 Ltd)
T789 Pvt.LTd    (T123 Inc Ltd, T789 Pvt.LTd)
ABC Limited      (T123 Inc Ltd, ABC Limited)
ABCDE Ltd          (T123 Inc Ltd, ABCDE Ltd)
dtype: object
print(idx['ABCDE Ltd'])
T123 Inc Ltd    (ABCDE Ltd, T123 Inc Ltd)
T124 PVT ltd    (ABCDE Ltd, T124 PVT ltd)
T345 Ltd            (ABCDE Ltd, T345 Ltd)
T789 Pvt.LTd    (ABCDE Ltd, T789 Pvt.LTd)
ABC Limited      (ABCDE Ltd, ABC Limited)
ABC Ltd              (ABCDE Ltd, ABC Ltd)
ABFE                    (ABCDE Ltd, ABFE)
Tesla ltd          (ABCDE Ltd, Tesla ltd)
AMAZON Inc        (ABCDE Ltd, AMAZON Inc)
Apple ltd          (ABCDE Ltd, Apple ltd)
dtype: object
print(idx['Apple ltd'])
ABCDE Ltd      (Apple ltd, ABCDE Ltd)
ABC Ltd          (Apple ltd, ABC Ltd)
ABFE                (Apple ltd, ABFE)
Tesla ltd      (Apple ltd, Tesla ltd)
AMAZON Inc    (Apple ltd, AMAZON Inc)
dtype: object
  • Related