Home > database >  Fix the data frame: for one date one value in Pandas
Fix the data frame: for one date one value in Pandas

Time:08-06

I just started learning the Pandas package in Python and I need help.

I have got this data frame:

    dataStream  rowCount    probingTimestamp
0   BCL_TaskCreation    315349655   2020-01-13 09:33:00.497
1   BCL_TaskCreation    315349655   2020-01-13 09:34:00.157
2   BCL_TaskCreation    314533770   2020-01-11 01:18:00.143
3   BCL_TaskCreation    289249872   2020-04-05 16:10:00.333
4   BCL_TaskCreation    289249872   2020-04-05 16:11:00.883
... ... ... ...
160730  BCL_TaskCreation    289702842   2020-05-02 04:43:00.767
160731  BCL_TaskCreation    289702943   2020-05-02 07:26:01.040
160732  BCL_TaskCreation    289702883   2020-05-02 06:10:00.157
160733  BCL_TaskCreation    289702842   2020-05-02 04:18:00.117
160734  BCL_TaskCreation    289702883   2020-05-02 06:11:00.597

As you can see, values in the 0 and 1 rows (date: 2020-01-13), 3 and 4 rows (date: 2020-04-05) are the same, because the date is the same, but from 160730 to 160734 row values in the column rowCount are different. And I want to make values of one day the same. For each day, this value will be set as the first encountered value for that day. That is, if we meet a new day 2020-01-10 10:04:18.837b and it has a value 314015548 from rowCount, then for the next hours of the same day the value will be the same: 314015548.

So what I want to get:

    dataStream  rowCount    probingTimestamp
0   BCL_TaskCreation    315349655   2020-01-13 09:33:00.497
1   BCL_TaskCreation    315349655   2020-01-13 09:34:00.157
2   BCL_TaskCreation    314533770   2020-01-11 01:18:00.143
3   BCL_TaskCreation    289249872   2020-04-05 16:10:00.333
4   BCL_TaskCreation    289249872   2020-04-05 16:11:00.883
... ... ... ...
160730  BCL_TaskCreation    289702842   2020-05-02 04:43:00.767
160731  BCL_TaskCreation    289702842   2020-05-02 07:26:01.040
160732  BCL_TaskCreation    289702842   2020-05-02 06:10:00.157
160733  BCL_TaskCreation    289702842   2020-05-02 04:18:00.117
160734  BCL_TaskCreation    289702842   2020-05-02 06:11:00.597

How can I get it in Pandas library in Python? Or I can get it only using the Python loops? If I can get only using the Python loops, how can I get it?

Thanks.

CodePudding user response:

try:

# only if the column probingTimestamp is not of type datetime64 else skip this line
df.probingTimestamp = df.probingTimestamp.astype('datetime64')

df['rowCount'] = df.probingTimestamp.dt.date\       # Get date without time
    .map(                                           # Map the first value for each date as follow
        df\
            .sort_values(by='probingTimestamp')     # Sort df by timestamp
            .groupby(df.probingTimestamp.dt.date)\  # Group by date
            .first().rowCount                       # Get first rowCount for each date
    )

Output for the test data provided:

    dataStream          rowCount    probingTimestamp
0   BCL_TaskCreation    315349655   2020-01-13 09:33:00.497
1   BCL_TaskCreation    315349655   2020-01-13 09:34:00.157
2   BCL_TaskCreation    314533770   2020-01-11 01:18:00.143
3   BCL_TaskCreation    289249872   2020-04-05 16:10:00.333
4   BCL_TaskCreation    289249872   2020-04-05 16:11:00.883
5   BCL_TaskCreation    289702842   2020-05-02 04:43:00.767
6   BCL_TaskCreation    289702842   2020-05-02 07:26:01.040
7   BCL_TaskCreation    289702842   2020-05-02 06:10:00.157
8   BCL_TaskCreation    289702842   2020-05-02 04:18:00.117
9   BCL_TaskCreation    289702842   2020-05-02 06:11:00.597
  • Related