Home > Blockchain >  Pandas conditional join and calculation
Pandas conditional join and calculation

Time:01-25

I have two Pandas dataframes, df_stock_prices and df_sentiment_mean.

I would like to do the following:

  1. Left join/merge these two dataframes into one dataframe, joined by Date and by ticker. In df_stock_prices, ticker is the column name, for example AAPL.OQ and in df_sentiment_mean ticker is found within the rows of the column named ticker.

  2. If there is a Date and ticker from df_stock_prices that doesn't match df_sentiment_mean, keep the non-matching row of df_stock_prices as-is (hence the left join).

  3. When there is a match for both Date and ticker, multiply the fields together; for example in the dataframes listed below, if df_stock_prices Date is 2021-11-29 and column AAPL.OQ is a match for the df_sentiment_mean Date of 2021-11-29 and ticker AAPL.OQ, then multiply the values for the match, in this example: 160.24 * 0.163266.

If a Date and ticker from df_stock_prices doesn't match a Date and ticker value from df_sentiment_mean, keep the values from df_stock_prices.

Current dataframes:

df_stock_prices:


            AAPL.OQ  ABBV.N   ABT.N   ACN.N  ADBE.OQ  AIG.N  AMD.OQ  AMGN.OQ  
Date                                                                           
2021-11-29   160.24  116.89  128.03  365.82   687.49  54.95  161.91   203.47   
2021-11-30   165.30  115.28  125.77  357.40   669.85  52.60  158.37   198.88   
2021-12-01   164.77  115.91  126.74  360.14   657.41  51.72  149.11   200.80   
2021-12-02   163.76  116.87  128.38  365.30   671.88  53.96  150.68   201.17   
2021-12-03   161.84  118.85  130.27  361.42   616.53  53.32  144.01   202.44   
...

df_sentiment_mean:

            ticker      diff
Date                         
2021-11-29  AAPL.OQ  0.163266
2021-11-29   ABBV.N -0.165520
2021-11-29    ABT.N  0.149920
2021-11-29  ADBE.OQ -0.014639
2021-11-29    AIG.N -0.448595
...             ...       ...
2023-01-12    LOW.N  0.008863
2023-01-12    MDT.N  0.498884
2023-01-12     MO.N -0.013428
2023-01-12    NEE.N  0.255223
2023-01-12    NKE.N  0.072752

Desired dataframe, partial first row example:

df_new:

            AAPL.OQ         ABBV.N      ABT.N        ACN.N     ADBE.OQ       AIG.N  …
Date
2021-11-29  26.16174384    -19.3476328  19.1942576  365.82    -100.6416611  -0.80441305  …
...

CodePudding user response:

You have to reshape your second dataframe then multiply both dataframes. Finally, fill nan by original values of your first dataframe:

# df1 = df_stock_prices
# df2 = df_sentiment_mean
df_new = df1.mul(df2.set_index('ticker', append=True)['diff'].unstack('ticker')).fillna(df1)
>>> df1
            AAPL.OQ  ABBV.N   ABT.N   ACN.N  ADBE.OQ  AIG.N
Date                                                       
2021-11-29   160.24  116.89  128.03  365.82   687.49  54.95


>>> df2
             ticker      diff
Date                         
2021-11-29  AAPL.OQ  0.163266
2021-11-29   ABBV.N -0.165520
2021-11-29    ABT.N  0.149920
2021-11-29  ADBE.OQ -0.014639
2021-11-29    AIG.N -0.448595


>>> df_new
              AAPL.OQ     ABBV.N      ABT.N   ACN.N    ADBE.OQ      AIG.N
Date                                                                     
2021-11-29  26.161744 -19.347633  19.194258  365.82 -10.064166 -24.650295

CodePudding user response:

Here's another way to solve your problem:

# Necessary imports
import numpy as np
import pandas as pd

# Create a dataframe with the stock prices
df_stock_prices = pd.DataFrame(
    [
        ["2021-11-29", 160.24, 116.89, 128.03, 365.82, 687.49, 54.95, 161.91, 203.47],
        ["2021-11-30", 165.30, 115.28, 125.77, 357.40, 669.85, 52.60, 158.37, 198.88],
        ["2021-12-01", 164.77, 115.91, 126.74, 360.14, 657.41, 51.72, 149.11, 200.80],
        ["2021-12-02", 163.76, 116.87, 128.38, 365.30, 671.88, 53.96, 150.68, 201.17],
        ["2021-12-03", 161.84, 118.85, 130.27, 361.42, 616.53, 53.32, 144.01, 202.44],
    ],
    columns=["Date", "AAPL.OQ", "ABBV.N", "ABT.N", "ACN.N", "ADBE.OQ", "AIG.N",
             "AMD.OQ", "AMGN.OQ"],
)

# Create a dataframe with the sentiment mean
df_sentiment_mean = pd.DataFrame(
    [
        ["2021-11-29", "AAPL.OQ", 0.163266],
        ["2021-11-29", "ABBV.N", -0.165520],
        ["2021-11-29", "ABT.N", 0.149920],
        ["2021-11-29", "ADBE.OQ", -0.014639],
        ["2021-11-29", "AIG.N", -0.448595],
        ["2023-01-12", "LOW.N", 0.008863],
        ["2023-01-12", "MDT.N", 0.498884],
        ["2023-01-12", "MO.N", -0.013428],
        ["2023-01-12", "NEE.N", 0.255223],
        ["2023-01-12", "NKE.N", 0.072752],
    ],
    columns=["Date", "ticker", "diff"],
)

# Melt the stock prices dataframe
df_stock_prices = df_stock_prices.melt(id_vars=['Date'], var_name='ticker', value_name='price')
# Output:
#
#           Date   ticker   price
# 0   2021-11-29  AAPL.OQ  160.24
# 1   2021-11-30  AAPL.OQ  165.30
# 2   2021-12-01  AAPL.OQ  164.77
# 3   2021-12-02  AAPL.OQ  163.76
# 4   2021-12-03  AAPL.OQ  161.84
# 5   2021-11-29   ABBV.N  116.89
# 6   2021-11-30   ABBV.N  115.28
# 7   2021-12-01   ABBV.N  115.91
# 8   2021-12-02   ABBV.N  116.87
# 9   2021-12-03   ABBV.N  118.85
# 10  2021-11-29    ABT.N  128.03
# ...        ...      ...     ...

# Merge the two dataframes
df_new = pd.merge(df_stock_prices, df_sentiment_mean, how='left', on=['Date', 'ticker'])
# Output:
#
#           Date   ticker   price      diff
# 0   2021-11-29  AAPL.OQ  160.24  0.163266
# 1   2021-11-30  AAPL.OQ  165.30       NaN
# 2   2021-12-01  AAPL.OQ  164.77       NaN
# 3   2021-12-02  AAPL.OQ  163.76       NaN
# 4   2021-12-03  AAPL.OQ  161.84       NaN
# 5   2021-11-29   ABBV.N  116.89 -0.165520
# 6   2021-11-30   ABBV.N  115.28       NaN
# 7   2021-12-01   ABBV.N  115.91       NaN
# 8   2021-12-02   ABBV.N  116.87       NaN
# 9   2021-12-03   ABBV.N  118.85       NaN
# 10  2021-11-29    ABT.N  128.03  0.149920
# ...        ...      ...     ...       ...

# Create a new column with the value of the stock price multiplied by the sentiment mean
df_new['value'] = np.where(df_new['diff'].isna(), df_new['price'], df_new['price'] * df_new['diff'])
# Output:
#
#           Date   ticker   price      diff       value
# 0   2021-11-29  AAPL.OQ  160.24  0.163266   26.161744
# 1   2021-11-30  AAPL.OQ  165.30       NaN  165.300000
# 2   2021-12-01  AAPL.OQ  164.77       NaN  164.770000
# 3   2021-12-02  AAPL.OQ  163.76       NaN  163.760000
# 4   2021-12-03  AAPL.OQ  161.84       NaN  161.840000
# 5   2021-11-29   ABBV.N  116.89 -0.165520  -19.347633
# ...        ...      ...     ...       ...         ...

# Pivot the dataframe
df_new = df_new.pivot(index='Date', columns='ticker', values='value')
# Output:
#
# ticker         AAPL.OQ      ABBV.N       ABT.N   ACN.N     ADBE.OQ      AIG.N  AMD.OQ  AMGN.OQ
# Date                                                                                          
# 2021-11-29   26.161744  -19.347633   19.194258  365.82  -10.064166 -24.650295  161.91   203.47
# 2021-11-30  165.300000  115.280000  125.770000  357.40  669.850000  52.600000  158.37   198.88
# 2021-12-01  164.770000  115.910000  126.740000  360.14  657.410000  51.720000  149.11   200.80
# 2021-12-02  163.760000  116.870000  128.380000  365.30  671.880000  53.960000  150.68   201.17
# 2021-12-03  161.840000  118.850000  130.270000  361.42  616.530000  53.320000  144.01   202.44
  • Related