Home > Software design >  Improve nested loop with pandas
Improve nested loop with pandas

Time:03-04

I am using pandas in a python notebook to make some data analysis. I am trying to make a simple nested loop, but this is very bad performing.

The problem is that I have two tables made of two columns each, the first containing time stamps (hh:mm:ss) and the second containing some integer values.

The first table (big_table) contains 86400 rows, one for each possible timestamp in a day, and each integer value is initially set to 0. The second table (small_table) contains less rows, one for each timestamp in which an actual integer value is registered. The goal is to map the small_table integers to the big_table integers, in the rows where the timestamp is the same. I also want to write the last written integer when the small_table timestamp is not found in the big_table timestamps.

I am doing this trying to "force" a Java/C way of doing it, which iterates over each element accessing them as the [i][j] elements of a matrix.

Is there any better way of doing this using pandas/numpy?

Code:

rel_time_pointer = small_table.INTEGER.iloc[0]

for i in range(small_table.shape[0]):

  for j in range(big_table.shape[0]):

    if (small_table.time.iloc[i] == big_table.time.iloc[j]):
      rel_time_pointer = small_table.INTEGER.iloc[i]
      big_table.INTEGER.iloc[j] = rel_time_pointer
      break

    else:
      big_table.INTEGER.iloc[j] = rel_time_pointer

example:

big_table:
time        INTEGER
00:00:00    0
00:00:01    0
00:00:02    0
00:00:03    0
00:00:04    0
00:00:05    0
00:00:06    0
    .
    .
    .
23:59:59    0

small_table:
time        INTEGER
00:00:03    100
00:00:05    100

big_table_after_execution:
time        INTEGER
00:00:00    0
00:00:01    0
00:00:02    0
00:00:03    100
00:00:04    100
00:00:05    200
00:00:06    200

Using the @gtomer merge command:

    big_table = big_table.merge(small_table, on='time', how='left')

and adding .fillna(0) at the end of the command I get:

    time    INTEGER__x  INTEGER__y
    00:00:00    0   0.0
    00:00:01    0   0.0
    ...         ... ...

with the INTEGER values of small_table in the right places of big_table_after_execution. Now I'm trying to set the 0 values to the not-0 top element:

    time    INTEGER__x  INTEGER__y
    00:00:00    0   0.0
    00:00:01    0   0.0
    00:00:02    0   0.0
    00:00:03    0   1.0
    00:00:04    0   1.0
    00:00:05    0   2.0
    00:00:06    0   2.0

instead of:

    00:00:00    0   0.0
    00:00:01    0   0.0
    00:00:02    0   0.0
    00:00:03    0   1.0
    00:00:04    0   0.0
    00:00:05    0   2.0
    00:00:06    0   0.0

CodePudding user response:

Please try the following:

big_table_after_execution = big_table.merge(small_table, on='time', how='left')

Please post the output you get and we'll continue from there

CodePudding user response:

Numpy iteration and enumeration options:

if you have a 2d np.ndarray type object, then iteration can be achieved in one line as follows:

for (i,j), value in np.ndenumerate(ndarray_object):... 

This works like regular enumerate, but allows you to deconstruct the higher dimensional index into a tuple of appropriate dimensions.

You could maybe place your values into a 2d array structure from numpy and iterate through them like that?

The easiest way to modify what you already have so that it looks less 'c-like' is probably to just use regular enumerate:

for small_index, small_value in enumerate(small_table):

  for big_index, big_value in enumerate(big_table):...

zip

Another option for grouping your iteration together is the zip() function, which will combine iterable 1 and 2, but it will only produce a resultant iterable as with a length equal to the minimum iterable length.

  • Related