split rows where one column's value changed the sign but crossed zero

I have a dataframe like these:

symbol      Id      Volume    cumVolume   ...                                                 
00001   93050000    100         100     ...
00001   93156340    100         200     ...    
00001   94040000   -200           0     ...    
00001   94041040   -100        -100     ...    
...       ...       ...         ...                      
00002   93050000   -100        -100     ...
00002   93156340   -100        -200     ...    
00002   94040000    100        -100     ...    
00002   94041040    400         300     ...

Ideally, the cumVolume of a symbol changed the sign by reaching zero first, such as 00001 (from 200 to 0 then to -100). However, for some symbols like 00002, the cumVolume changed the sign without reaching zero but cross it (from -100 to 300).

I would like to split these rows and get a dataframe like this:

symbol      Id      Volume    cumVolume   ...                                                 
00001   93050000    100         100     ...
00001   93156340    100         200     ...    
00001   94040000   -200           0     ...    
00001   94041040   -100        -100     ...    
...       ...       ...         ...                      
00002   93050000   -100        -100     ...
00002   93156340   -100        -200     ...    
00002   94040000    100        -100     ...    
00002   94041040    100           0     ...   
00002   94041040    300         300     ...

Notice that I split the last row into two rows by dividing the original Volume 400 into 100 and 300, so that my cumVolume can display zero now. Other columns' information should stay the same.

I am a little struggling with this. I tried to insert rows based on indexes but my dataset is very large and there are many columns and symbols. It is hard to get the indexes of rows that should be split and to change two columns' values. It would be great if you could give me some hints or solutions.

CodePudding user response：

Getting which index to split, you can do it as below.

def should_split_row(current_cumVolume, previous_cumVolume):
    if current_cumVolume < 0 and previous_cumVolume > 0:
        return True
    elif current_cumVolume > 0 and previous_cumVolume < 0:
        return True


previous_symbol = df['symbol'].iloc[0]

for index, row in df.iterrows():
    current_symbol = row['symbol']
    if current_symbol == previous_symbol and index !=0 and should_split_row(row['cumVolume'], df['cumVolume'].iloc[index-1]):        
        print('This row should be split: index ->' , index)

    previous_symbol = current_symbol

Further, to insert new rows, you can refer to this question Is it possible to insert a row at an arbitrary position in a dataframe using pandas?

CodePudding user response：

"It is hard to get the indexes of rows that should be split and to change two columns' values"

It is actually fairly easy, you need to track the sign and using the difference between consecutive values you can identify the change.

import numpy as np
s = np.sign(df['cumVolume'])

Output:

0    1
1    1
2    0
3   -1
4   -1
5   -1
6   -1
7    1
Name: cumVolume, dtype: int64

Now check the difference between successive rows per group. If 2, we switched from negative to positive without stopping through zero. If -2, positive to negative.

mask = s.groupby(df['symbol']).diff().abs().eq(2)

idx = s[mask].index
# Int64Index([7], dtype='int64')

Now you have a boolean mask and the list of indices where a hidden switch happens.

To generate the final output, simply concat the modified slice and the original dataframe:

df2 = (pd.concat([df[mask].assign(cumVolume=0),
                  df])
         .sort_index()
         )

Output:

   symbol        Id  Volume  cumVolume
0       1  93050000     100        100
1       1  93156340     100        200
2       1  94040000    -200          0
3       1  94041040    -100       -100
4       2  93050000    -100       -100
5       2  93156340    -100       -200
6       2  94040000     100       -100
7       2  94041040     400          0
7       2  94041040     400        300