I want to detect sign change of my data using either pandas or numpy. I want to count the number(s) of id
which changes sign of y
between two immediate TIMESTEP
values (eg. for 2800 and 2900 TIMESTEPs, id
313 has changed sign (y
becomes negative). I have tried the below code by counting negatives and then using drop duplicate but that again not efficient and correct.
df_negatives0 = df0.query('y < 0')
df_nonduplicate0=df_negatives0.drop_duplicates(subset=["id"])
My dataset:
TIMESTEP id mass y
0 42 0.755047 0.489375
0 245 0.723805 0.479446
0 344 0.675664 0.463363
...
...
2800 313 0.795699 0.00492984
2800 425 0.68311 0.282356
2900 42 0.755047 0.424421
2900 245 0.723805 0.0378489
2900 344 0.675664 0.127917
2900 313 0.795699 -0.0149792
2900 425 0.68311 0.273884
...
...
upto
10000000
My desired data:
TIMESTEP id_count mass
2900 1 0.795699
...
...
500000 2 0.85245 0.54852 (i want to sum the masses if id count is more than one)
...
...
upto
10000000
CodePudding user response:
There is a dedicated function in numpy np.sign
, which is conveniently available as a method in pandas series:
# this will return the sign of the float x
df['x'].sign()
For sign change from one row to the next, it's possible to use the .diff
method:
# this will return the difference between the sign of two consecutive rows
df['x'].sign().diff()
CodePudding user response:
- For each id, find the sign-changed rows
- Do Aggregation for each TIMESTEP
def filter_sign_changed(df_same_id):
df_same_id = df_same_id.sort_values(by="TIMESTEP")
sign_changed = df_same_id.y * df_same_id.y.shift(1) < 0
return df_same_id[sign_changed]
sign_changed_df = df.groupby("id").apply(filter_sign_changed)
result = sign_changed_df.groupby("TIMESTEP", as_index=False).agg(id_count=("id", "size"), mass=("mass", "sum"))
result