How can I get values one day later (one row below) depending on the largest values in a row of anoth-CodePudding

I have two data frames with the same date index and column names. I want to search after the n largest values each row, then go back to the other dataframe and search for the values one day later (one row below).

This is about financial data. I want to test the forward performance of the n best stocks regarding an indicator value of these stocks. So I search first for the n-largest indicator values of these stocks each day (first dataframe) and try to get the performance data (second dataframe) of them one day later or one row below. example:

import pandas as pd
df = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'],
    'a':[24, 11.2, 10.3, 16], 
    'b':[10.5, 12, 11.4, 13], 
    'c':[11, 11.6, 13, 15], 
    'd':[17, 10.5, 13, 15], 
    'e':[22, 21, 14.5, 13],
    'f':[12.7, 10, 10, 10],
    'g':[14, 13, 13, 12],
    'h':[13.8, 19, 13, 12],
    'i':[11, 14, 10.4, 16.3],
    'j':[10, 10, 19, 15.6]}) 
df = df.set_index('Date')

df1 = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'],
    'a':[0, 1.2, 0.3, 6], 
    'b':[0.5, 2, 1.4, 3], 
    'c':[2, 1.6, 3, 5], 
    'd':[7, 0.5, 2.4, 5], 
    'e':[4, 2, 4.5, 1],
    'f':[2.7, 0, 0, 0],
    'g':[4, 3, 3, 2],
    'h':[3.8, 1, 9, 2],
    'i':[1, 4, 0.4, 6.3],
    'j':[0, 0, 9, 5.6]}) 
df1 = df1.set_index('Date')

In the next step I would filter the e.g. 3 largest values each row.

df_test = df.apply(pd.Series.nlargest,axis=1, n=3)

or I filter after the column names which are the 3 largest stocks each day

df_large = df.apply(lambda s: s.abs().nlargest(3).index.tolist(), axis=1)
df_large = df_large.to_frame()
df_large [['Top1', 'Top2', 'Top3']] = pd.DataFrame(df_large[0].values.tolist(),index=df_large.index)
df_large = df_large.drop([0], axis=1)

But I don't know if I'm on the right way. I simply don't how to connect the new dataframe with the one I want to take the performance data the day after. Maybe there is a easier way (for sure there will be)

At the end I want to have a result like this:

CodePudding user response：

Reshape your data to have a Stock index level and only one column. Then:

import pandas as pd

dates = pd.date_range(start='2021-01-01', end='2021-01-04', freq='D')


def sanitise(df: pd.DataFrame) -> pd.DataFrame:
    df = pd.DataFrame({'Value': df.stack()})
    df.index.names = 'Date', 'Stock'
    return df


df0 = sanitise(pd.DataFrame(
    {
        'a': (24.0, 11.2, 10.3, 16.0),
        'b': (10.5, 12.0, 11.4, 13.0),
        'c': (11.0, 11.6, 13.0, 15.0),
        'd': (17.0, 10.5, 13.0, 15.0),
        'e': (22.0, 21.0, 14.5, 13.0),
        'f': (12.7, 10.0, 10.0, 10.0),
        'g': (14.0, 13.0, 13.0, 12.0),
        'h': (13.8, 19.0, 13.0, 12.0),
        'i': (11.0, 14.0, 10.4, 16.3),
        'j': (10.0, 10.0, 19.0, 15.6),
    }, index=dates,
))

df1 = sanitise(pd.DataFrame(
    {
        'a': (0.0, 1.2, 0.3, 6.0),
        'b': (0.5, 2.0, 1.4, 3.0),
        'c': (2.0, 1.6, 3.0, 5.0),
        'd': (7.0, 0.5, 2.4, 5.0),
        'e': (4.0, 2.0, 4.5, 1.0),
        'f': (2.7, 0.0, 0.0, 0.0),
        'g': (4.0, 3.0, 3.0, 2.0),
        'h': (3.8, 1.0, 9.0, 2.0),
        'i': (1.0, 4.0, 0.4, 6.3),
        'j': (0.0, 0.0, 9.0, 5.6),
    }, index=dates,
))


# Get the top 3 stock values in each date group
top = (
    df0
    .sort_values('Value', ascending=False)
    .groupby(level='Date')
    .head(3)
    .sort_index()  # optional, but helps with debugging
    .loc[dates[:-1]]
)

# The value of 'top' doesn't matter, only the index.
top.index.set_levels(dates[1:], level=0, inplace=True)

'''
Select, for each day, from df_1, the three stock values for 
companies that - the previous day - had the three top stocks.
'''
print(df1.loc[top.index])

                  Value
Date       Stock       
2021-01-02 a        1.2
           d        0.5
           e        2.0
2021-01-03 e        4.5
           h        9.0
           i        0.4
2021-01-04 d        5.0
           e        1.0
           j        5.6

CodePudding user response：

We'll apply a custom function that matches the 3 largest per row of df with a shifted version of df1, and then takes only the values.

We can use result_type='expand' to shape the values into a DataFrame, and then shift the result back to match what we wanted.

func = lambda x: (df1.shift(-1)
                     .loc[x.name, x.abs().nlargest(3).index]
                     .values)

df2 = df.apply(func, axis=1, result_type='expand').shift()
print(df2)

Output:

              0    1    2
Date
2021-01-01  NaN  NaN  NaN
2021-01-02  1.2  2.0  0.5
2021-01-03  4.5  9.0  0.4
2021-01-04  5.6  1.0  5.0

Rename the columns with df2.columns = ['Top1', 'Top2', 'Top3'] if you'd like.