Fastest way to store and save coordinate data in a loop-CodePudding

I getting face landmarks for each frame in a video. There are 477 landmarks, and each one is a (3,) vector.

I have a 10 minute video at 30 fps. This means that I have 18000 arrays of shape (477,3). I want to store all this info in a pandas dataframe where each row is a frame and has 477 columns, one for each (3,) array.

Currently, I am doing this:

frame_lms = []
for frame in video:
    landmark_dict = {}
    lm_count = 0
    for landmark in frame:
        x = landmark.x
        y = landmark.y
        xy = np.array([x,y])
        landmark_dict[f"lm_{count}"] = xy
        lm_count =1
    frame_lms.append(landmark_dict)
df = pd.DataFrame.from_dict(frame_lms)
df.to_csv('save.csv')

I got the idea to store everything in a list of dicts, append to a list, and then save from research showing that from_dict is the fastest way to create a pandas df. However, this process is still slow because I have to hold frame_lms in state, which gets huge as I append (477,3) arrays into it.

What is the most computationally efficeint way to solve a problem like this?

CodePudding user response：

It is better to avoid creating and converting to numpy.array many objects in the inner portion of nested a loop. Your code is much faster if you change xy = np.array([x, y]) in the inner loop to xy = (x, y). In the following code, I left the conversion to numpy.ndarray out since I understand it is OK for the OP.

Since python is very efficient managing lists you can create a list of lists with the data, and asign the column names when creating the DataFrame.

The faster, pythonic way of creating the list is

rv = [[(lm.x, lm.y) for lm in f] for f in video]

The it is equivalent to the following, slightly slower code (not recommended):

import numpy as np

# load video here

rv = []
for frame in video:
    internal = []
    for landmark in frame:
        internal.append((landmark.x, landmark.y))
    rv.append(internal)

You can create the DataFrame from the lists using

df = pd.DataFrame(rv, columns=[f"lm_{count}" for count in range(477)])