Calculate mean value for each pixel of a sum of Xarray DataArrays-CodePudding

I am trying to calculate a fog frequency map based on a number of geoTIFFs that I have read as Xarray DataArrays using the rioxarray.open_rasterio function in Python 3.10. Each "pixel" can have one of the following values: 1 = fog, 0 = no fog, -9999 = no data. The end goal is to calculate a new DataArray that contains the ratio of the number "fog" pixels to the number of pixel with either "fog" or "no fog".

For this I want to write a for-loop that creates the sum of "fog" and "no_fog" entries per pixel while excluding the "no data" pixels. Then it should divide the pixel values of the sum DataArray by the number of pixels that were used in the calculation of each individual sum. So, if for a single pixel there are the following values: 0, 1, 1, -9999, 0, and -9999 the loop should create a sum of 2 and divide it by 4, creating a fog frequency of 0.5 or 50%.

So far, I have only been able to calculate the sum of all input DataArrays, without excluding the "no data" pixels using this code:

# open all fog maps and create a list:
folder = "E:/Jasper/Studium/BA_Thesis/MODIS_data/MODIS_2021_data/2021_06/fog_frequency"
list_of_maps = glob.glob(folder   '/fog_map*.tif', recursive=True)  # all files that start with "fog_map"

# make list with all different filenames (dates) in this folder:
maps = []  # initialize empty list for all file names
for i in range(0, np.size(list_of_maps)):
    # files naming convention "fog_map_YYYYMMDD_HHMMSS.tif":
    maps.append(list_of_maps[i].split('fog_map_')[1][0:8])

# find out how many dates are in the folder:
maps = np.unique(maps)  # remove duplicates from array
print(maps)
print('\ndata from {} different dates in this folder\n'.format(np.size(maps)))

# create fog_sum xarray dataArray to have something to start out with and later subtract it again:
fog_sum = rioxarray.open_rasterio("E:/Jasper/Studium/BA_Thesis/MODIS_data/MODIS_2021_data/2021_06/fog_frequency/fog_map_20210601.tif")
fog_sum_subtract = rioxarray.open_rasterio("E:/Jasper/Studium/BA_Thesis/MODIS_data/MODIS_2021_data/2021_06/fog_frequency/fog_map_20210601.tif")

# add all fog maps:
for i in range(0, np.size(list_of_maps)):
    # open data sets:
    fog_map = rioxarray.open_rasterio(list_of_maps[i], engine='rasterio')
    # fog_map = fog_map.where(fog_map >= 0)
    fog_sum = fog_sum   fog_map

# subtract original fog map and export as geoTIFF:
fog_sum = fog_sum - fog_sum_subtract
fog_sum.rio.to_raster("E:/Jasper/Studium/BA_Thesis/MODIS_data/MODIS_2021_data/2021_06/fog_frequency/fog_sum.tif",
                      driver="GTiff")

I tried to exclude the "no data" values using fog_map = fog_map.where(fog_map >= 0), but this left me with a fog_sum geoTIFF, where each pixel had the value 1.79769e 308

This is an example of what the output of a fog_map_YYYYMMDD.tif DataArray looks like, before applying the fog_map.where(fog_map >= 0) function:

<xarray.DataArray (band: 1, y: 412, x: 388)>
[159856 values with dtype=float32]
Coordinates:
  * band         (band) int32 1
  * x            (x) float64 -92.49 -92.49 -92.48 ... -89.02 -89.02 -89.01
  * y            (y) float64 2.0 1.991 1.982 1.973 ... -1.674 -1.683 -1.692
    spatial_ref  int32 0
Attributes:
    STATISTICS_MAXIMUM:        1
    STATISTICS_MEAN:           -858.62379891903
    STATISTICS_MINIMUM:        -9999
    STATISTICS_STDDEV:         2801.4551987932
    STATISTICS_VALID_PERCENT:  100
    scale_factor:              1.0
    add_offset:                0.0

And after applying the function:

<xarray.DataArray (band: 1, y: 412, x: 388)>
array([[[ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0., nan,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        ...,
        [ 0.,  0.,  0., ..., nan, nan, nan],
        [ 0.,  0.,  0., ..., nan, nan, nan],
        [ 0.,  0.,  0., ..., nan, nan, nan]]], dtype=float32)
Coordinates:
  * band         (band) int32 1
  * x            (x) float64 -92.49 -92.49 -92.48 ... -89.02 -89.02 -89.01
  * y            (y) float64 2.0 1.991 1.982 1.973 ... -1.674 -1.683 -1.692
    spatial_ref  int32 0
Attributes:
    STATISTICS_MAXIMUM:        1
    STATISTICS_MEAN:           -858.62379891903
    STATISTICS_MINIMUM:        -9999
    STATISTICS_STDDEV:         2801.4551987932
    STATISTICS_VALID_PERCENT:  100
    scale_factor:              1.0
    add_offset:                0.0

Any help is greatly appreciated!

CodePudding user response：

If your data is this small it’s probably fastest and easiest to concat all the data, drop the -9999 values using the da.where() code you suggest, then just take the mean over the concatenated dimension:

fog_maps = []
for i in range(0, np.size(list_of_maps)):
    # open data sets:
    fog_map = rioxarray.open_rasterio(list_of_maps[i], engine='rasterio')
    fog_maps.append(fog_map)

all_fog = xr.concat(fog_maps, dim="file")
all_fog = all_fog.where(all_fog >= 0)
mean_fog = all_fog.mean(dim="file")

Your data is already float (though maybe it shouldn’t be?) so replacing -9999 with np.nan (a float) isn’t hurting you.

But if you’re facing memory constraints and want to accumulate the stats iteratively I’d use xarray.where(da >= 0, 1, 0) to accumulate your denominator:

# add all fog maps:
data_count = 0
fog_sum = 0
for i in range(0, np.size(list_of_maps)):
    # open data sets:
    fog_map = rioxarray.open_rasterio(list_of_maps[i], engine='rasterio')
    data_count  = xr.where(fog_map >= 0), 1, 0)
    fog_sum  = fog_map.where(fog_map > 0, 0)

fog_mean = (fog_sum / data_count).where(data_count > 0)

xr.where is like xr.DataArray.where but you define the value returned if true explicitly.

As an aside, the value you were seeing, 1.797e308, is the largest value that can be held by a float64. So you’re probably just looking at the GeoTiff encoding of np.inf after a divide by zero problem. Make sure to mask the data where your denominator is zero and handle it the way you’re intending for locations where there is never any valid data.