Home > Software design >  filtering "events" in awkward-array
filtering "events" in awkward-array

Time:08-13

I am reading data from a file of "events". For each event, there is some number of "tracks". For each track there are a series of "variables". A stripped down version of the code (using awkward0 as awkward) looks like

f = h5py.File('dataAA/pv_HLT1CPU_MinBiasMagDown_14Nov.h5',mode="r")

afile = awkward.hdf5(f)

pocaz  = np.asarray(afile["poca_z"].astype(dtype_X))

pocaMx = np.asarray(afile["major_axis_x"].astype(dtype_X))
pocaMy = np.asarray(afile["major_axis_y"].astype(dtype_X))
pocaMz = np.asarray(afile["major_axis_z"].astype(dtype_X))

In this snippet of code, "pocaz", "pocaMx", etc. are what I have called variables (a physics label, not a Python data type). On rare occasions, pocaz takes on an extreme value, pocaMx and/or pocaMy take on nan values, and/or pocaMz takes on the value inf. I would like to remove these tracks from the events using some syntactically simple method. I am guessing this functionality exists (perhaps in the current version of awkward but not awkward0), but cannot find it described in a transparent way. Is there a simple example anywhere?

Thanks, Mike

CodePudding user response:

It looks to me, from the fact that you're able to call np.asarray on these arrays without error, that they are one-dimensional arrays of numbers. If so, then Awkward Array isn't doing anything for you here; you should be able to find the one-dimensional NumPy arrays inside

f["poca_z"], f["major_axis_x"], f["major_axis_y"], f["major_axis_z"]

as groups (note that this is f, not afile) and leave Awkward Array entirely out of it.

The reason I say that is because you can use np.isfinite on these NumPy arrays. (There's an equivalent in Awkward v1, v2, but you're talking about Awkward v0 and I don't remember.) That will give you an array of booleans for you to slice these arrays.

I don't have the HDF5 file for testing, but I think it would go like this:

f = h5py.File('dataAA/pv_HLT1CPU_MinBiasMagDown_14Nov.h5',mode="r")

pocaz = np.asarray(a["poca_z"]["0"], dtype=dtype_X)

pocaMx = np.asarray(a["major_axis_x"]["0"], dtype=dtype_X)   # the only array
pocaMy = np.asarray(a["major_axis_y"]["0"], dtype=dtype_X)   # in each group
pocaMz = np.asarray(a["major_axis_z"]["0"], dtype=dtype_X)   # is named "0"

good = np.ones(len(pocaz), dtype=bool)
good &= np.isfinite(pocaz)
good &= np.isfinite(pocaMx)
good &= np.isfinite(pocaMy)
good &= np.isfinite(pocaMz)

pocaz[good], pocaMx[good], pocaMy[good], pocaMz[good]

If you also need to cut extreme finite values, you can include

good &= (-1000 < pocaz) & (pocaz < 1000)

etc. in the good selection criteria.

(The way you'd do this in Awkward Array is not any different, since Awkward is just generalizing what NumPy does here, but if you don't need it, you might as well leave it out.)

CodePudding user response:

If you want numpy arrays, why not read the data with h5py functions? It provides a very natural way to return the datasets as arrays. Code would look like this. (FYI, I used the file context manager to open the file.)

with h5py.File('dataAA/pv_HLT1CPU_MinBiasMagDown_14Nov.h5',mode="r") as h5f:
    # the [()] returns the dataset as an array:
    pocaz_arr = h5f["poca_z"]["0"][()]
    # verify array shape and datatype:
    print(f"Shape: {pocaz_arr.shape},  Dtype: {poca_z_arr.dtype})")
    pocaMx_arr = h5f["major_axis_x"]["0"][()]  # the only dataset
    pocaMy_arr = h5f["major_axis_y"]["0"][()]  # in each group
    pocaMz_arr = h5f["major_axis_z"]["0"][()]  # is named "0"
  • Related