Iterating over NumPy array for timestep-CodePudding

I have 5 columns in an array that are datetime type values. There are 100k values in each column

The values are the datetime for a timeline going from 2015-01-15 00:30 to 2020-12-31 23:00 in 30 minute increments.

Basically what I want to do is to loop through values in the array, and check if the current value is an exact 30 minute timestep from the last value. In an array with columns this would be the value above the current value being investigated

Theres probably a few ways to do this, but I've included a pseudocode sample of how I'm thinking about it


for row in _the_whole_array :
        for cell in row:
            if cell == to the 30 minute timestep of the cell above it
              continue iterating
            else:
              store that value
return the smallest timestep found, and the biggest timestep found

I have looked at for loops as well as nditer but I'm getting errors with iterating over datetimes, and I'm also wondering how I can find the cell value above the current cell value above it.

Any help hugely appreciated

CodePudding user response：

try using pandas dataFrame

import pandas
import numpy 

# Creating Dataframe From NumPy Array
df = pd.DataFrame(yout_array)

#iterating through the dataframe
for index, row in df.iterrows():


    #do something with index and row
    print(type(index))
    print(index)
    print('~~~~~~')

    print(type(row))
    print(row)
    print('------')

CodePudding user response：

i wasn't sure of your axis, even if you use rows on your pseudocode. but the overall idea is the same: if your dtype is datetime you can do basic arithmetic on it

x, y = _the_whole_array.shape
for row in range(x):
    diff=_the_whole_array[row][1:]-_the_whole_array[row][:-1]
    print(np.amin(diff), np.amax(diff))
    
for column in range(y):
    diff=_the_whole_array[:,column][1:]-_the_whole_array[:,column][:-1]
    print(np.amin(diff), np.amax(diff))

CodePudding user response：

As mentioned by @Heidiki, consider using pandas because it will be much easier to process large data.

To address your question, you can create a temporary variable to store values of previous row while looping. At every iteration, you calculate the differences and check for absolute min and max, exactly as your pseudocode.

Here are an example and test data, 3x5-matrix. Please check whether the output matches your requirement.

from datetime import datetime, timedelta
import pandas as pd
import numpy as np

# test data
arr = np.array([
    [datetime(2019, 1, 2, 13, 12), datetime(2019, 1, 2, 15, 19),
     datetime(2019, 1, 2, 15, 59), datetime(2019, 1, 2, 17, 23),
     datetime(2019, 1, 2, 15, 18)],
    [datetime(2019, 1, 2, 13, 34), datetime(2019, 1, 2, 15, 57),
     datetime(2019, 1, 2, 18, 53), datetime(2019, 1, 2, 17, 34),
     datetime(2019, 1, 2, 15, 29)],
    [datetime(2019, 1, 2, 13, 49), datetime(2019, 1, 2, 16, 35),
     datetime(2019, 1, 2, 21, 18), datetime(2019, 1, 2, 17, 59),
     datetime(2019, 1, 2, 15, 46)]
])


def timedelta_to_minutes(dt: timedelta) -> int:
    return (dt.days * 24 * 60)   (dt.seconds // 60)


def min_max_timestep(data: np.array) -> tuple:
    prev_row = min_step = max_step = None
    df = pd.DataFrame(data)
    for idx, row in df.iterrows():
        if not idx:
            prev_row = row
            continue

        diff = row - prev_row
        min_diff, max_diff = min(diff), max(diff)
        if min_step is None or min_diff < min_step:
            min_step = min_diff

        if max_step is None or max_diff > max_step:
            max_step = max_diff

        prev_row = row

    # convert timedelta to minutes
    return timedelta_to_minutes(min_step), timedelta_to_minutes(max_step)


result = min_max_timestep(arr)