Efficient procedure to fill in a dataframe (recoding code with lists to use ndarrays)-CodePudding

I am trying to calculate some function results using predefined parameters that are generated before and stored in lists.

But I need to recode this solution to save all results to a dataframe of appropriate structure, where each row contain next parameters:

E | number_of_iteration | result

The actual mapping is like this:

E -> newenergy_grid[i]
number_of_iteration -> [j]
result -> gaus_res_ij

Here is the code which is running now and collects all data to the "list of lists" object. I have tried to get all data into a dataframe (code commented after list generation) - but it works very slowly.

It's about 10000 points in a list and about 10000 in a offset list.

I understand that I need to optimize solution to collect and store all the data into single dataframe. How to do it right?

Maybe I can use numpy for efficient calculation and after - just save the resulting ndarray into one dataframe? I understand the idea but I don't really get how to realize it..

# TODO: IMPORTANT(!) it's much easier to work with dataframes...
# here I want to have such structure of resulting dataframe
# E | number_of_iteration | result
# results_df = pd.DataFrame()

all_res = [] #global list of lists - each list for one energy level

for i in range(0, len(newenergy_grid)):

    # res_for_one_E = np.empty(len(energy_grid)) 
    #iterating through the energy grid
    list_for_one_E = []

    for j in range(0, len(doffset)):
        #iterating through the seeding steps

        # selection of parameters for function calculation - they are stored in the lists
        p = [
            damp1[j], dcen1[j], dsigma1[j],
            damp2[j], dcen2[j], dsigma2[j],
            damp3[j], dcen3[j], dsigma3[j],
            doffset[j]
        ] 

        gaus_res_ij = _3gaussian(newenergy_grid[i], *p) # for the i-th energy level and for j-th realization of a curve

        list_for_one_E.append(gaus_res_ij)
        """
        #dataframes for future analysis
        # too slow solution... how to speed up it or use numpy?

        temp = pd.DataFrame({
        "E": newenergy_grid[i],
        "step_number": j,
        "res": res_for_one_E
        }, index=[j])
        results_df = pd.concat([results_df, temp])
        """
        
    all_res.append(list_for_one_E) #appending the calculated list to the 'list of lists'

UPDATE - providing some data for understanding.

newenergy_grid [135.11618653 135.12066818 135.12514984 ... 179.91929833 179.92377998 179.92826164] - about 10000 points which represent energy coordinates for a function under test. The number of points depends on data obtained for observation, in my case 100...100000 points on the energy axis.

in the middle of the code I have p - list of parameters for the function _3gaussian(newenergy_grid[i], *p) - which calculates the resulting value I need to save for the point with coordinates (newenergy_grid[i], j). So it's the function f(x,y*) that calculates the value at the x = newenergy_grid[i], and y* stands for set of parameters, each element of a set is dependent on iteration number j and is already calculated - so on the each step ij I am only selecting the parameters y using j, and calculating the value of a function f(x[i], y*[j]).

List p is constructed on each step j using pregenerated lists (damp1[j], dcen1[j], dsigma1[j], damp2[j], dcen2[j], dsigma2[j], damp3[j], dcen3[j], dsigma3[j], doffset[j])

j - is the index of a point in the lists of parameters, it can be iterated from 0 to len(offset). Each of the parameters of a p list has about 10000 elements, so j changes from 0 to len(doffset)

damp1 [19.85128939 19.32065201 ... 19.50304656]
dsigma1 [0.07900404 0.0798217 ... 0.08074941]
...

CodePudding user response：

Putting some random data together according to your description, and running the code you provided:

from random import random


def _3gaussian(x, *y):
    return x * sum([*y])  # some arbitrary value, not relevant to the question


n = 10  # size is arbitrary, scale up to see performance impacts
m = 20
newenergy_grid = [random() * 200 for _ in range(m)]
damp1 = [random() * 20 for _ in range(n)]
damp2 = [random() * 20 for _ in range(n)]
damp3 = [random() * 20 for _ in range(n)]
dsigma1 = [random() * .1 for _ in range(n)]
dsigma2 = [random() * .1 for _ in range(n)]
dsigma3 = [random() * .1 for _ in range(n)]
dcen1 = [random() for _ in range(n)]
dcen2 = [random() for _ in range(n)]
dcen3 = [random() for _ in range(n)]
doffset = [random() for _ in range(n)]

all_res = []

for i in range(0, len(newenergy_grid)):
    list_for_one_E = []

    for j in range(0, len(doffset)):
        p = [
            damp1[j], dcen1[j], dsigma1[j],
            damp2[j], dcen2[j], dsigma2[j],
            damp3[j], dcen3[j], dsigma3[j],
            doffset[j]
        ]

        gaus_res_ij = _3gaussian(newenergy_grid[i], *p)

        list_for_one_E.append(gaus_res_ij)

    all_res.append(list_for_one_E)

print(all_res)

This results in an m x n list of lists, or 2D matrix. So, it's certainly suited to be stored in a DataFrame, although it's not going to be much faster without looking at how you construct the numbered variables before you run this section of code.

The naming and structure of damp1, damp2, .., dsigma1, dsigma2, .. suggest there's some substantial optimisation possible before you get to this code.

However, starting with those named lists, this would be an improvement, for brevity if anything:

from pandas import DataFrame, concat

df = DataFrame([
    damp1, damp2, damp3, dsigma1, dsigma2, dsigma3, dcen1, dcen2, dcen3, doffset
])
result_df = concat([df.apply(lambda row: _3gaussian(e, *row)) 
                    for e in newenergy_grid], axis=1).T

from numpy import isclose
print(isclose(result_df, DataFrame(all_res)))

Note: I'm using isclose here because due to floating point inaccuracy, the results won't actually be identical, as pandas' float64 is not the same as Python's float.

As for performance, you'll likely find that using DataFrames here is no faster, in fact probably an order of magnitude or more slower, since you're not leveraging the data types available in numpy and pandas at all.

If you load your data in suitable types from the beginning and vectorise your function, you may be able to improve substantially on speed - but that depends not on the code you shared here, but on all the other code and exceeds the scope of a single question.

You could try submitting a more comprehensive example on Code Review and ask for help there.

CodePudding user response：

I have found some solution of my problem. As I am a newbie in Python, I have to make a lot of errors before I will code better (probably :), -> I know that answer was easy and clear from the very first line).

The key was to "vectorize" the function _3gaussian(...), and apply it to the entire vector of parameters at once with the minimal use of cycles and iterations.

Inside this function I have used numpy and ndarrays as inputs so there weren't any problems with the performance. With such solution the resulting dataframe was calculated more than one hundred times faster.

here is the code which is doing the same things as the code in a question.

rand_iter_num = N_seed
e_len = np.count_nonzero(newenergy_grid)

result_arr = np.empty((rand_iter_num, e_len), dtype=np.float64)

print(rand_iter_num)
print(e_len)

for j in range(0, rand_iter_num):

    p = [
            damp1[j], dcen1[j], dsigma1[j],
            damp2[j], dcen2[j], dsigma2[j],
            damp3[j], dcen3[j], dsigma3[j],
            doffset[j]
    ] 

    gaus_res = _3gaussian(newenergy_grid, *p)
    result_arr[j] = gaus_res
df_res = pd.DataFrame(result_arr, columns=newenergy_grid)