Home > Software engineering >  Python DataFrame manipulation: How to extract a set of columns in a fast way
Python DataFrame manipulation: How to extract a set of columns in a fast way

Time:05-06

I need to access and extract information from a Dataframe that is used for other colleagues in a research group.

The DataFrame structure is:

zee.loc[zee['layer']=='EMB2'].loc[zee['roi']==0]

            e           et       eta       phi      deta    dphi     samp    hash     det   layer roi   eventNumber
2249    20.677443   20.675829   0.0125  -1.067651   0.025   0.024544    3   2030015444  2   EMB2    0   2
2250    21.635288   21.633598   0.0125  -1.043107   0.025   0.024544    3   2030015445  2   EMB2    0   2
2251    -29.408310  -29.406013  0.0125  -1.018563   0.025   0.024544    3   2030015446  2   EMB2    0   2
2252    43.127533   43.124165   0.0125  -0.994020   0.025   0.024544    3   2030015447  2   EMB2    0   2
2253    -3.025344   -3.025108   0.0125  -0.969476   0.025   0.024544    3   2030015448  2   EMB2    0   2
... ... ... ... ... ... ... ... ... ... ... ... ...
4968988 -5.825550   -5.309279   0.4375  -0.454058   0.025   0.024544    3   2030019821  2   EMB2    0   3955
4968989 39.750645   36.227871   0.4375  -0.429515   0.025   0.024544    3   2030019822  2   EMB2    0   3955
4968990 80.568573   73.428436   0.4375  -0.404971   0.025   0.024544    3   2030019823  2   EMB2    0   3955
4968991 -28.921751  -26.358652  0.4375  -0.380427   0.025   0.024544    3   2030019824  2   EMB2    0   3955
4968992 55.599472   50.672146   0.4375  -0.355884   0.025   0.024544    3   2030019825  2   EMB2    0   3955

So, I need to work only with the layer: EMB2 and the columns: et, eta, phi. To pick up these columns, I'm using the following code:

EtEtaPhi, EventLens  = [], []
events = set(zee.loc[zee['layer']=='EMB2']['eventNumber'].to_numpy())
roi    = set(zee.loc[zee['layer']=='EMB2']['roi'].to_numpy())
for ee in events:
    for rr in roi:
        if len(zee.loc[zee['layer']=='EMB2'].loc[zee['eventNumber']==ee].loc[zee['roi']==rr])==0: break       
        EtEtaPhi.append(zee[['et','eta','phi']].loc[zee['layer']=='EMB2'].loc[zee['eventNumber']==ee].loc[zee['roi']==rr].to_numpy())
        EventLens.append(len(EtEtaPhi[-1]))

But to read 4000 events take so long time, almost one second per event. This result isn't good, almost one hour just to extract those columns!

Is there some way to extract columns from a DataFrame more efficiently and faster?

CodePudding user response:

The code

zee[['et','eta','phi']].loc[zee['layer']=='EMB2']

which you already have somewhere in there should do what you asked for. The rest is not needed.

CodePudding user response:

Just use .loc:

sample = zee.loc[zee["layer"].eq("EMB2"), ["et","eta","phi"]]
  • Related