I just started learning the Pandas package in Python and I need help.
I have got this data frame:
dataStream rowCount probingTimestamp
0 BCL_TaskCreation 315349655 2020-01-13 09:33:00.497
1 BCL_TaskCreation 315349655 2020-01-13 09:34:00.157
2 BCL_TaskCreation 314533770 2020-01-11 01:18:00.143
3 BCL_TaskCreation 289249872 2020-04-05 16:10:00.333
4 BCL_TaskCreation 289249872 2020-04-05 16:11:00.883
... ... ... ...
160730 BCL_TaskCreation 289702842 2020-05-02 04:43:00.767
160731 BCL_TaskCreation 289702943 2020-05-02 07:26:01.040
160732 BCL_TaskCreation 289702883 2020-05-02 06:10:00.157
160733 BCL_TaskCreation 289702842 2020-05-02 04:18:00.117
160734 BCL_TaskCreation 289702883 2020-05-02 06:11:00.597
As you can see, values in the 0 and 1 rows (date: 2020-01-13), 3 and 4 rows (date: 2020-04-05) are the same, because the date is the same, but from 160730 to 160734 row values in the column rowCount are different. And I want to make values of one day the same. For each day, this value will be set as the first encountered value for that day. That is, if we meet a new day 2020-01-10 10:04:18.837b and it has a value 314015548 from rowCount, then for the next hours of the same day the value will be the same: 314015548.
So what I want to get:
dataStream rowCount probingTimestamp
0 BCL_TaskCreation 315349655 2020-01-13 09:33:00.497
1 BCL_TaskCreation 315349655 2020-01-13 09:34:00.157
2 BCL_TaskCreation 314533770 2020-01-11 01:18:00.143
3 BCL_TaskCreation 289249872 2020-04-05 16:10:00.333
4 BCL_TaskCreation 289249872 2020-04-05 16:11:00.883
... ... ... ...
160730 BCL_TaskCreation 289702842 2020-05-02 04:43:00.767
160731 BCL_TaskCreation 289702842 2020-05-02 07:26:01.040
160732 BCL_TaskCreation 289702842 2020-05-02 06:10:00.157
160733 BCL_TaskCreation 289702842 2020-05-02 04:18:00.117
160734 BCL_TaskCreation 289702842 2020-05-02 06:11:00.597
How can I get it in Pandas library in Python? Or I can get it only using the Python loops? If I can get only using the Python loops, how can I get it?
Thanks.
CodePudding user response:
try:
# only if the column probingTimestamp is not of type datetime64 else skip this line
df.probingTimestamp = df.probingTimestamp.astype('datetime64')
df['rowCount'] = df.probingTimestamp.dt.date\ # Get date without time
.map( # Map the first value for each date as follow
df\
.sort_values(by='probingTimestamp') # Sort df by timestamp
.groupby(df.probingTimestamp.dt.date)\ # Group by date
.first().rowCount # Get first rowCount for each date
)
Output for the test data provided:
dataStream rowCount probingTimestamp
0 BCL_TaskCreation 315349655 2020-01-13 09:33:00.497
1 BCL_TaskCreation 315349655 2020-01-13 09:34:00.157
2 BCL_TaskCreation 314533770 2020-01-11 01:18:00.143
3 BCL_TaskCreation 289249872 2020-04-05 16:10:00.333
4 BCL_TaskCreation 289249872 2020-04-05 16:11:00.883
5 BCL_TaskCreation 289702842 2020-05-02 04:43:00.767
6 BCL_TaskCreation 289702842 2020-05-02 07:26:01.040
7 BCL_TaskCreation 289702842 2020-05-02 06:10:00.157
8 BCL_TaskCreation 289702842 2020-05-02 04:18:00.117
9 BCL_TaskCreation 289702842 2020-05-02 06:11:00.597