I am using pandas in a python notebook to make some data analysis. I am trying to make a simple nested loop, but this is very bad performing.
The problem is that I have two tables made of two columns each, the first containing time stamps (hh:mm:ss) and the second containing some integer values.
The first table (big_table) contains 86400 rows, one for each possible timestamp in a day, and each integer value is initially set to 0. The second table (small_table) contains less rows, one for each timestamp in which an actual integer value is registered. The goal is to map the small_table integers to the big_table integers, in the rows where the timestamp is the same. I also want to write the last written integer when the small_table timestamp is not found in the big_table timestamps.
I am doing this trying to "force" a Java/C way of doing it, which iterates over each element accessing them as the [i][j] elements of a matrix.
Is there any better way of doing this using pandas/numpy?
Code:
rel_time_pointer = small_table.INTEGER.iloc[0]
for i in range(small_table.shape[0]):
for j in range(big_table.shape[0]):
if (small_table.time.iloc[i] == big_table.time.iloc[j]):
rel_time_pointer = small_table.INTEGER.iloc[i]
big_table.INTEGER.iloc[j] = rel_time_pointer
break
else:
big_table.INTEGER.iloc[j] = rel_time_pointer
example:
big_table:
time INTEGER
00:00:00 0
00:00:01 0
00:00:02 0
00:00:03 0
00:00:04 0
00:00:05 0
00:00:06 0
.
.
.
23:59:59 0
small_table:
time INTEGER
00:00:03 100
00:00:05 100
big_table_after_execution:
time INTEGER
00:00:00 0
00:00:01 0
00:00:02 0
00:00:03 100
00:00:04 100
00:00:05 200
00:00:06 200
Using the @gtomer merge command:
big_table = big_table.merge(small_table, on='time', how='left')
and adding .fillna(0) at the end of the command I get:
time INTEGER__x INTEGER__y
00:00:00 0 0.0
00:00:01 0 0.0
... ... ...
with the INTEGER values of small_table in the right places of big_table_after_execution. Now I'm trying to set the 0 values to the not-0 top element:
time INTEGER__x INTEGER__y
00:00:00 0 0.0
00:00:01 0 0.0
00:00:02 0 0.0
00:00:03 0 1.0
00:00:04 0 1.0
00:00:05 0 2.0
00:00:06 0 2.0
instead of:
00:00:00 0 0.0
00:00:01 0 0.0
00:00:02 0 0.0
00:00:03 0 1.0
00:00:04 0 0.0
00:00:05 0 2.0
00:00:06 0 0.0
CodePudding user response:
Please try the following:
big_table_after_execution = big_table.merge(small_table, on='time', how='left')
Please post the output you get and we'll continue from there
CodePudding user response:
Numpy iteration and enumeration options:
if you have a 2d np.ndarray type object, then iteration can be achieved in one line as follows:
for (i,j), value in np.ndenumerate(ndarray_object):...
This works like regular enumerate, but allows you to deconstruct the higher dimensional index into a tuple of appropriate dimensions.
You could maybe place your values into a 2d array structure from numpy and iterate through them like that?
The easiest way to modify what you already have so that it looks less 'c-like' is probably to just use regular enumerate:
for small_index, small_value in enumerate(small_table):
for big_index, big_value in enumerate(big_table):...
zip
Another option for grouping your iteration together is the zip()
function, which will combine iterable 1 and 2, but it will only produce a resultant iterable as with a length equal to the minimum iterable length.