Home > OS >  Is there a faster way to ignore dates and get specific times in a datetime series than a for loop?
Is there a faster way to ignore dates and get specific times in a datetime series than a for loop?

Time:12-03

Using Python 3.9 on mac with Pycharm. I have a .csv that contains two columns: dates/times as a string and a float:

                       time  Value
0       2021-12-01 20:00:00    166
1       2021-12-01 19:59:00    165

The granularity of the time is by the minute, most days (some days are skipped). What I would like to do is take a single value for each listed day at a specific time, for instance every day contained in the data at 5pm. That way instead of having hundreds of values per day, I would have:

                       time  Value
0       2021-12-02 17:00:00    166
1       2021-12-01 17:00:00    165
etc

The best I have come up with from reading about string/timeseries splitting is a for loop where the times are split and the index of 17:00:00 is used to form a new list

df = pd.read_csv('name')

datetime = df['time'].str.split()
temp_df = pd.DataFrame()
for _ in range(0, len(df)):
    if datetime[_][1] == '17:00:00':
        temp_df = temp_df.append(df.loc[[_]], ignore_index=True)
temp_df['time'] = pd.to_datetime(temp_df['time']) #make datetime for further computing
print(temp_df)

output:
...
                           time  Value
    175     2021-12-01 17:00:00    164
    1083    2021-11-30 17:00:00    165
    3313    2021-11-24 17:00:00    161

The other way would be to split it using df['Dates'] = pd.to_datetime(df['time']).dt.date, but it doesn't seem any less demanding than my current loop.

I've found sources on sorting/selecting dates/ranges (including the Pandas documentation), but none that allow you to maintain yet ignore the date while selecting for time?

CodePudding user response:

Try this:

df[df.time.str.contains('17:00:00')]

CodePudding user response:

maintain yet ignore the date while selecting for time

After converting the times to_datetime:

df.time = pd.to_datetime(df.time)
  1. Either use Series.dt.time to compare against datetime.time:

    from datetime import time
    temp_df = df[df.time.dt.time == time(17, 0)]
    
  2. Or use Series.dt.strftime to compare just the %H:%M:%S portion:

    temp_df = df[df.time.dt.strftime('%H:%M:%S').eq('17:00:00')]
    

This is all done without a loop, using vectorized comparisons and boolean indexing.

You can also check against strings (like sayv's answer), but storing time data as actual datetime objects will make analysis easier in the long run.

CodePudding user response:

If you want to discard data where the timestamp's hour is not 17:00:00 anyway, I think the most efficient solution is to not parse to datetime upon import to df, i.e. leave date/time as string. Then select the elements where hour is 5 pm, and parse only those to datetime.

A small timeit comparison shows that string comparison is about as good as comparing the time part of a datetime dtype column to a time object - however string comparison would save computations as you parse a lot less string to datetime in the first place.

from datetime import time
import pandas as pd

# make hourly data, ~100k elements
s_dt = pd.date_range('2010-01-01', '2020-12-31', freq='H').to_series()
s_str = s_dt.astype(str)

%timeit m = s_str.str.contains('17:00:00')
38.5 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit m = s_str.str.split(' ').str[1] == '17:00:00'
94 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit m = s_dt.dt.time == time(17, 0)
41.9 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • Related