I have a data set that I am loading onto a Pandas dataframe that is a Jagged 3-D array called: Waveform. The dataframe is multi-indexed by three levels: Events (Entry), Photons (Subentry) generated by each event, data points (subsubentry) per photon.
The number of data points and Photons varies randomly per each event, hence why it's jagged. I want to extract the Photons (subentries) for each Event (entry) that contain at minimum an "2*n" number of data points, take the average of the first n data points of the selected photons and save them in a new dataframe that contains their respective event and photon index.
I can't put my actual data here because it's too large and jagged so I will create an example that is the same, just scaled down in terms of data.
import awkward as ak
import pandas as pd
#I'm gonna generate an example awkward array that I want to convert to a Pandas DF
wf = ak.to_pandas(ak.Array([ [[1,2,5,6,8,3,21,3],[5986.472,0,6,1,2,3],[0]],[[1]],[[0.1,23,534,21,53,12],[0]],[[1],[2],[0],[12,12,12,12,125,34]],[[76],[23,23,43],],[[0],[12,12,12,12]] ]))
print(wf)
values
entry subentry subsubentry
0 0 0 1.000
1 2.000
2 5.000
3 6.000
4 8.000
5 3.000
6 21.000
7 3.000
1 0 5986.472
1 0.000
2 6.000
3 1.000
4 2.000
5 3.000
2 0 0.000
1 0 0 1.000
2 0 0 0.100
1 23.000
2 534.000
3 21.000
4 53.000
5 12.000
1 0 0.000
3 0 0 1.000
1 0 2.000
2 0 0.000
3 0 12.000
1 12.000
2 12.000
3 12.000
4 125.000
5 34.000
4 0 0 76.000
1 0 23.000
1 23.000
2 43.000
5 0 0 0.000
1 0 12.000
1 12.000
2 12.000
3 12.000
#This is what I want the filter/Extraction to produce
wf_pF = ak.to_pandas(ak.Array([[[1,2,5,6,8,3,21,3],[5986.472,0,6,1,2,3,5]],[[0.1,23,534,21,53,12]],[[12,12,12,12,125,34]] ]))
print(wf_pF)
values
entry subentry subsubentry
0 0 0 1.000
1 2.000
2 5.000
3 6.000
4 8.000
5 3.000
6 21.000
7 3.000
1 0 5986.472
1 0.000
2 6.000
3 1.000
4 2.000
5 3.000
6 5.000
1 0 0 0.100
1 23.000
2 534.000
3 21.000
4 53.000
5 12.000
2 0 0 12.000
1 12.000
2 12.000
3 12.000
4 125.000
5 34.000
#I then want to take the average of the first n datapoints and place them into a new dataframe as such
averages = ak.to_pandas(ak.Array([[2.666,1997.333],[185.7],[12]]))
print(averages)
values
entry subentry
0 0 2.666
1 1997.333
1 0 185.700
2 0 12.000
I used query to look for the 2n -1 datapoint first (in this case I used n = 3 so "5") in the level subsubentry Wf_n = wf.query('subsubentry == 5')
. I took the index of this new dataframe Wf_n and converted the indices of the Entry and Subentry into their respective NumPy arrays
nQuery = wf.query('subsubentry == 5')
indices = nQuery.index.to_frame()["entry"]
indices2 = nQuery.index.to_frame()["subentry"]
ind = pd.Series.to_numpy(indices)
ind2 = pd.Series.to_numpy(indices2)
Then I used query to extract the Entries with their respective subentries with the following:
wf_AF = wf.query("entry in @ind and subentry in @ind2")
print(wf_AF)
which results in this dataframe wf_AF
values
entry subentry subsubentry
0 0 0 1.000
1 2.000
2 5.000
3 6.000
4 8.000
5 3.000
6 21.000
7 3.000
1 0 5986.472
1 0.000
2 6.000
3 1.000
4 2.000
5 3.000
2 0 0 0.100
1 23.000
2 534.000
3 21.000
4 53.000
5 12.000
1 0 0.000
3 0 0 1.000
1 0 2.000
3 0 12.000
1 12.000
2 12.000
3 12.000
4 125.000
5 34.000
It's still keeping subentries (Photons) that contain less than the desired subsubentries (datapoints) number threshold, 2*n. What am I doing wrong? Is there something I am not understanding? What can I do to achieve this specific method of filtering and can it be implemented in CuDF? Because there's so much data it would be ideal if I could replicate this as well in CuDF.
CodePudding user response:
UPDATE: I deserve a dang medal...
Reading through the documentation of awkward
, particularly check out help(ak.Array.__getitem__)
as it's very informative. Also see help(ak.num)
, help(ak.mean)
etc.
These use awkwards
' own methods to filter, slice, and perform calculations on the array... which I'm sure is significantly faster than pandas. Pandas are only used here for visualization.
array = ak.Array([ [[1,2,5,6,8,3,21,3],[5986.472,0,6,1,2,3],[0]],[[1]],[[0.1,23,534,21,53,12],[0]],[[1],[2],[0],[12,12,12,12,125,34]],[[76],[23,23,43],],[[0],[12,12,12,12]] ])
filtered = array[ak.num(array, -1)>5]
df = ak.to_pandas(filtered)
print(df)
Output:
values
entry subentry subsubentry
0 0 0 1.000
1 2.000
2 5.000
3 6.000
4 8.000
5 3.000
6 21.000
7 3.000
1 0 5986.472
1 0.000
2 6.000
3 1.000
4 2.000
5 3.000
2 0 0 0.100
1 23.000
2 534.000
3 21.000
4 53.000
5 12.000
3 0 0 12.000
1 12.000
2 12.000
3 12.000
4 125.000
5 34.000
Say we only want the first three values from each of those...
first_three = filtered[:,:,:3]
df2 = ak.to_pandas(first_three)
print(df2)
Output:
values
entry subentry subsubentry
0 0 0 1.000
1 2.000
2 5.000
1 0 5986.472
1 0.000
2 6.000
2 0 0 0.100
1 23.000
2 534.000
3 0 0 12.000
1 12.000
2 12.000
Now say we want the average of those...
averages = ak.mean(first_three, axis=2)
df3 = ak.to_pandas(averages)
print(df3)
Output:
values
entry subentry
0 0 2.666667
1 1997.490667
2 0 185.700000
3 0 12.000000
All in one step:
ak.to_pandas(ak.mean(array[ak.num(array, -1)>5,:3], axis=2))
Pandas Only:
n = 3
wf_AF = wf.reset_index(-1).groupby(level=[-2, -1]).filter(lambda x: x.subsubentry.max() >= 2*n-1).set_index('subsubentry', append=True)
# OR
wf_AF = wf.groupby(level=[-3, -2]).filter(lambda x: x.index.to_series().str[2].max() >= 2*n-1)
print(wf_AF)
Output:
values
entry subentry subsubentry
0 0 0 1.000
1 2.000
2 5.000
3 6.000
4 8.000
5 3.000
6 21.000
7 3.000
1 0 5986.472
1 0.000
2 6.000
3 1.000
4 2.000
5 3.000
2 0 0 0.100
1 23.000
2 534.000
3 21.000
4 53.000
5 12.000
3 3 0 12.000
1 12.000
2 12.000
3 12.000
4 125.000
5 34.000
averages = wf_AF.groupby(level=[-3, -2]).agg(lambda x: x.head(n).mean())
print(averages)
Output:
values
entry subentry
0 0 2.666667
1 1997.490667
2 0 185.700000
3 3 12.000000