Is there a more efficient way of averaging across arrays whilst also filtering out invalid values?-CodePudding

I am trying to find the average values of arrays in a dataset. I saw some people use np.mean() to quickly find the averages across many arrays. However, this does not work for me because there are invalid values in the datasets which are represented by the default value of 0.

My solution is to loop through all the values in each array and add them to a sum array and incrementing the denominator array by 1. My code is shown below.

from pyhdf.SD import SD, SDC
import numpy as np
import os

a, b = np.shape(lst_template)
sum_arr = np.zeros((a,b))
denominator_arr = np.zeros((a, b))

for filename in os.listdir(file_location):
    try:
        file = SD(file_location   '/'   filename)
        lst = file.select('LST').get()
        if np.shape(lst) == (a, b):
            for i in range(a):
                for j in range(b):
                    if lst[i][j] != 0:
                        sum_arr[i][j]  = lst[i][j]
                        denominator_arr[i][j]  = 1
            print(filename)
        else:
            print('shape incorrect '   filename)
            
    except Exception as e:
        print(e)
        print(filename)

result_arr = np.zeros((a, b))

for i in range(a):
    for j in range(b):
        if denominator_arr[i][j] != 0:
            result_arr[i][j] = sum_arr[i][j]/denominator_arr[i][j]

However, this method seems to be very slow. I need to loop through around 800 datasets. When testing for 80 datasets, it was already taking a very long time. Are there any more efficient methods of averaging these values?

CodePudding user response：

First get all the datasets into a list and collect them into a list. Then use np.asarray to convert to an array, with the 0th dimension distinguishing data sets.

np.sum and np.count_nonzero can both take an axis over which to operate as an argument, which allows them to act equivalently to your loop. Using these methods is faster since the looping is done in C (assuming CPython). The average can thus be calculated as np.sum(datasets_arr, 0) / np.count_nonzero(datasets_arr, 0). Final code:

from pyhdf.SD import SD, SDC
import numpy as np
import os

a, b = np.shape(lst_template)

datasets = []
for filename in os.listdir(file_location):
    try:
        file = SD(file_location   '/'   filename)
        lst = file.select('LST').get()
        if np.shape(lst) == (a, b):
            datasets.append(lst)
        else:
            print('shape incorrect '   filename)
            
    except Exception as e:
        print(e)
        print(filename)

datasets_arr = np.asarray(datasets)
denominator = np.count_nonzero(datasets_arr, 0)
denominator[denominator == 0] = 1
result_arr = np.sum(datasets_arr, 0) / denominator

Another thing to try after doing this is to time getting all the datasets. I'm not familiar with pyhdf but it's possible that this is your bottleneck, rather than the averaging.