Home > Net >  How to detect sign change of values of a column of a pandas dataframe using numpy or pandas?
How to detect sign change of values of a column of a pandas dataframe using numpy or pandas?

Time:06-13

I want to detect sign change of my data using either pandas or numpy. I want to count the number(s) of id which changes sign of y between two immediate TIMESTEP values (eg. for 2800 and 2900 TIMESTEPs, id 313 has changed sign (y becomes negative). I have tried the below code by counting negatives and then using drop duplicate but that again not efficient and correct.

df_negatives0 = df0.query('y < 0')
df_nonduplicate0=df_negatives0.drop_duplicates(subset=["id"])

My dataset:

TIMESTEP    id  mass    y
0       42  0.755047    0.489375
0       245 0.723805    0.479446
0       344 0.675664    0.463363
...
...
2800    313 0.795699    0.00492984
2800    425 0.68311     0.282356
2900    42  0.755047    0.424421
2900    245 0.723805    0.0378489
2900    344 0.675664    0.127917
2900    313 0.795699    -0.0149792
2900    425 0.68311     0.273884
...
...
upto
10000000

My desired data:

TIMESTEP    id_count    mass
2900        1           0.795699
...
...
500000      2           0.85245   0.54852 (i want to sum the masses if id count is more than one)
...
...
upto
10000000

CodePudding user response:

There is a dedicated function in numpy np.sign, which is conveniently available as a method in pandas series:

# this will return the sign of the float x
df['x'].sign()

For sign change from one row to the next, it's possible to use the .diff method:

# this will return the difference between the sign of two consecutive rows
df['x'].sign().diff()

CodePudding user response:

  • For each id, find the sign-changed rows
  • Do Aggregation for each TIMESTEP
def filter_sign_changed(df_same_id):
    df_same_id = df_same_id.sort_values(by="TIMESTEP")
    sign_changed = df_same_id.y * df_same_id.y.shift(1) < 0
    return df_same_id[sign_changed]

sign_changed_df = df.groupby("id").apply(filter_sign_changed)
result = sign_changed_df.groupby("TIMESTEP", as_index=False).agg(id_count=("id", "size"), mass=("mass", "sum"))
result
  • Related