I am new to this data science world and trying to understand some basic pandas examples.
I have a pandas data frame that I would like to create a new column and add some conditional values as below:
It will include yes
at every 2 seconds. Otherwise include no
. Here is an example:
This is my original data frame.
id name time
0 1 name1 260.123
1 2 name2 261.323
2 3 name3 261.342
3 4 name4 261.567
4 5 name5 262.123
...
The new data frame will be like this:
id name time time_delta
0 1 name1 260.123 yes
1 2 name2 261.323 no
2 3 name3 261.342 no
3 4 name4 261.567 no
4 5 name5 262.123 yes
5 6 name6 262.345 yes
6 7 name7 264.876 yes
7 8 name8 265.234 no
8 9 name9 266.234 yes
9 10 name10 267.234 no
...
The code that I was using is:
df['time_delta'] = df['time'].apply(apply_test)
And the actual code of the function:
def apply_test(num):
prev = num
if round(num) != prev 2:
prev = prev
return "no"
else:
prev = num
return "yes"
Please note that the time column has decimals and no patterns.
The result came as all no
since the prev is assigned to the next number at each iteration. This was the way I thought it would be. Not sure if there are any other better ways. I would appreciate any help.
UPDATE:
- Please note that the time column has decimals and the decimal values have no value in this case. For instance, time=234.xxx will be considered as 234 seconds. Therefore, the next 2 second point is 236.
- The data frame has multiple second value if we round it down. In this case, all of them have to be marked as
yes
. Please refer to the updates result data frame as an example.
CodePudding user response:
You can use:
import numpy as np
N = 2 # time step
# define bins every N seconds
bins = np.arange(np.floor(df['time'].min()), df['time'].max() N, 2)
# get the index of the first row per group
idx = df.groupby(pd.cut(df['time'], bins))['time'].idxmin()
# assign "yes" to the first else "no"
df['timedelta'] = np.where(df.index.isin(idx), 'yes', 'no')
Output:
id name time time_delta
0 1 name1 260.123 yes
1 2 name2 260.323 no
2 3 name3 261.342 no
3 4 name4 261.567 no
4 5 name5 262.123 yes
5 6 name6 263.345 no
6 7 name7 264.876 yes
CodePudding user response:
You can check when the remaining of the cumulative sum of the diff
changes value after divided by 2, that is when it enters a new segment of length 2:
remaining = (df['time'].diff().cumsum() // 2).fillna(0)
df['time_delta'] = np.where((~remaining.duplicated()), 'yes', 'no')