I have a pandas DataFrame containing symbols and times at which a transaction took place. There’s also a third column indicating whether the transaction was a purchase or a sale. For example,
pd.DataFrame([“AAPL”, “AAPL”, “GOOG”, “AAPL”, “GOOG”], [1, 2, 3, 6, 7], [“BUY”, “SELL”, “BUY”, “BUY”, “BUY”])
I would like to add a third column representing the time since the previous transaction on the same symbol regardless of type. In the example above, this column would be
[NaN, 1, NaN, 4, 4]
The first entry of each stock is always NaN because there’s no transaction on the same symbol before it. The second entry happens to be 1 because the last time AAPL traded was 1 and 2 - 1 = 1 is the difference between the current time and the last trade time.
I’d also like to have a fourth column indicating whether the last transaction was a BUY or a SELL. Again in this example, we’d have
[NaN, BUY, NaN, SELL, BUY]
as the fourth column. Can someone please help me figure out how to do this?
I think I need to do a merge_asof to match the time to the smallest time before it. But since I want it to be independently computed for each symbol, perhaps I also need a groupby? I’m struggling to get the correct logic. Thanks
CodePudding user response:
Assuming this input and sorted "times":
col1 col2 col3
0 AAPL 1 BUY
1 AAPL 2 SELL
2 GOOG 3 BUY
3 AAPL 6 BUY
4 GOOG 7 BUY
You can use groupby.shift
/diff
:
g = df.groupby('col1')
df['col4'] = g['col3'].shift()
df['col5'] = g['col2'].diff()
Output:
col1 col2 col3 col4 col5
0 AAPL 1 BUY NaN NaN
1 AAPL 2 SELL BUY 1.0
2 GOOG 3 BUY NaN NaN
3 AAPL 6 BUY SELL 4.0
4 GOOG 7 BUY BUY 4.0