I have a dataframe like these:
symbol Id Volume cumVolume ...
00001 93050000 100 100 ...
00001 93156340 100 200 ...
00001 94040000 -200 0 ...
00001 94041040 -100 -100 ...
... ... ... ...
00002 93050000 -100 -100 ...
00002 93156340 -100 -200 ...
00002 94040000 100 -100 ...
00002 94041040 400 300 ...
Ideally, the cumVolume
of a symbol changed the sign by reaching zero first, such as 00001
(from 200 to 0 then to -100). However, for some symbols like 00002
, the cumVolume
changed the sign without reaching zero but cross it (from -100 to 300).
I would like to split these rows and get a dataframe like this:
symbol Id Volume cumVolume ...
00001 93050000 100 100 ...
00001 93156340 100 200 ...
00001 94040000 -200 0 ...
00001 94041040 -100 -100 ...
... ... ... ...
00002 93050000 -100 -100 ...
00002 93156340 -100 -200 ...
00002 94040000 100 -100 ...
00002 94041040 100 0 ...
00002 94041040 300 300 ...
Notice that I split the last row into two rows by dividing the original Volume
400 into 100 and 300, so that my cumVolume
can display zero now. Other columns' information should stay the same.
I am a little struggling with this. I tried to insert rows based on indexes but my dataset is very large and there are many columns and symbols. It is hard to get the indexes of rows that should be split and to change two columns' values. It would be great if you could give me some hints or solutions.
CodePudding user response:
Getting which index to split, you can do it as below.
def should_split_row(current_cumVolume, previous_cumVolume):
if current_cumVolume < 0 and previous_cumVolume > 0:
return True
elif current_cumVolume > 0 and previous_cumVolume < 0:
return True
previous_symbol = df['symbol'].iloc[0]
for index, row in df.iterrows():
current_symbol = row['symbol']
if current_symbol == previous_symbol and index !=0 and should_split_row(row['cumVolume'], df['cumVolume'].iloc[index-1]):
print('This row should be split: index ->' , index)
previous_symbol = current_symbol
Further, to insert new rows, you can refer to this question Is it possible to insert a row at an arbitrary position in a dataframe using pandas?
CodePudding user response:
"It is hard to get the indexes of rows that should be split and to change two columns' values"
It is actually fairly easy, you need to track the sign and using the difference between consecutive values you can identify the change.
import numpy as np
s = np.sign(df['cumVolume'])
Output:
0 1
1 1
2 0
3 -1
4 -1
5 -1
6 -1
7 1
Name: cumVolume, dtype: int64
Now check the difference between successive rows per group. If 2
, we switched from negative to positive without stopping through zero. If -2
, positive to negative.
mask = s.groupby(df['symbol']).diff().abs().eq(2)
idx = s[mask].index
# Int64Index([7], dtype='int64')
Now you have a boolean mask and the list of indices where a hidden switch happens.
To generate the final output, simply concat
the modified slice and the original dataframe:
df2 = (pd.concat([df[mask].assign(cumVolume=0),
df])
.sort_index()
)
Output:
symbol Id Volume cumVolume
0 1 93050000 100 100
1 1 93156340 100 200
2 1 94040000 -200 0
3 1 94041040 -100 -100
4 2 93050000 -100 -100
5 2 93156340 -100 -200
6 2 94040000 100 -100
7 2 94041040 400 0
7 2 94041040 400 300