execute iterative queries over a pandas dataframe-CodePudding

I have a CSV file which looks like:

Detection,Imagename,Frame_Identifier,TL_x,TL_y,BR_x,BR_y,detection_Confidence,Target_Length,Species,Confidence
0,201503.20150619.181140817.204628.jpg,0,272,142.375,382.5,340,0.475837,0,fish,0.475837
1,201503.20150619.181141498.204632.jpg,3,267.75,6.375,422.875,80.75,0.189145,0,fish,0.189145
2,201503.20150619.181141662.204633.jpg,4,820.25,78.625,973.25,382.5,0.615788,0,fish,0.615788
3,201503.20150619.181141662.204633.jpg,4,1257,75,1280,116,0.307278,0,fish,0.307278
4,201503.20150619.181141834.204634.jpg,5,194,281,233,336,0.586944,0,fish,0.586944

I load it as pandas.Dataframe named: imageannotation - I am interested in extracting a dictionary which has as key the imagename (note: Imagename can have duplicate rows), and as value, an other dictionary whit 2 keys: ['bbox',, 'species'], where bbox is a list given by the TL_x, TL_y, BR_x, BR_y values

I can accomplish this with the following code:


test = {
    i: {
        "bbox": imageannotation[imageannotation["Imagename"] == i][
            ["TL_x", "TL_y", "BR_x", "BR_y"]
        ].values,
        "species": imageannotation[imageannotation["Imagename"] == i][
            ["Species"]
        ].values,
    }
    for i in imageannotation["Imagename"].unique()
}

The results looks like this:

mydict = {'201503.20150619.181140817.204628': {'bbox': array([[272.   , 142.375, 382.5  , 340.   ]]),
  'species': array([['fish']], dtype=object)},
 '201503.20150619.181141498.204632': {'bbox': array([[267.75 ,   6.375, 422.875,  80.75 ]]),
  'species': array([['fish']], dtype=object)},
 '201503.20150619.181141662.204633': {'bbox': array([[ 820.25 ,   78.625,  973.25 ,  382.5  ],
         [1257.   ,   75.   , 1280.   ,  116.   ]]),
  'species': array([['fish'],
         ['fish']], dtype=object)},
 '201503.20150619.181141834.204634': {'bbox': array([[194., 281., 233., 336.],
         [766., 271., 789., 293.]]),
  'species': array([['fish'],
         ['fish']], dtype=object)}}

which is what I wanted but can get extremely slow when working on large files.

Q: Do you have any better way to accomplish this?

My final target is to add a new column to a dataframe imagemetadata which is bigger than the has an Imagename field with unique values - and I do this last operation with:

for i in mydict:
    imagemetadata.loc[imagemetadata.Imagename == i, "annotation"] = [test[I]]

CodePudding user response：

(Revised answer now that I re-read things.)

This seems to be what you might be after; group the annotations by Imagename, make a dict-of-lists out of them, map them into the other dataframe.

import io

import pandas as pd

imageannotation = pd.read_csv(
    io.StringIO(
        """
Detection,Imagename,Frame_Identifier,TL_x,TL_y,BR_x,BR_y,detection_Confidence,Target_Length,Species,Confidence
0,201503.20150619.181140817.204628.jpg,0,272,142.375,382.5,340,0.475837,0,fish,0.475837
1,201503.20150619.181141498.204632.jpg,3,267.75,6.375,422.875,80.75,0.189145,0,fish,0.189145
2,201503.20150619.181141662.204633.jpg,4,820.25,78.625,973.25,382.5,0.615788,0,fish,0.615788
3,201503.20150619.181141662.204633.jpg,4,1257,75,1280,116,0.307278,0,fish,0.307278
4,201503.20150619.181141834.204634.jpg,5,194,281,233,336,0.586944,0,fish,0.586944
"""
    )
)

# (Pretend this comes from a separate file)
imagemetadata = pd.DataFrame({"Imagename": imageannotation.Imagename.unique()})


def make_annotation(r):
    return {
        "bbox": [r.TL_x, r.TL_y, r.BR_x, r.BR_y],
        "species": r.Species,
    }


annotations_by_image = (
    imageannotation.groupby("Imagename")
    .apply(lambda r: r.apply(make_annotation, axis=1).to_list())
    .to_dict()
)
imagemetadata = pd.DataFrame({"Imagename": imageannotation.Imagename.unique()})
imagemetadata["annotation"] = imagemetadata.Imagename.map(annotations_by_image)

print(imagemetadata)

The output is

                              Imagename                                         annotation
0  201503.20150619.181140817.204628.jpg  [{'bbox': [272.0, 142.375, 382.5, 340.0], 'spe...
1  201503.20150619.181141498.204632.jpg  [{'bbox': [267.75, 6.375, 422.875, 80.75], 'sp...
2  201503.20150619.181141662.204633.jpg  [{'bbox': [820.25, 78.625, 973.25, 382.5], 'sp...
3  201503.20150619.181141834.204634.jpg  [{'bbox': [194.0, 281.0, 233.0, 336.0], 'speci...

If you want imagemetadata to have multiple lines if annotation has multiple entries,

imagemetadata = imagemetadata.explode("annotation").reset_index(drop=True)

Revised, again

For a dict-of-lists instead of a list-of-dicts, it's even simpler:

# Generate a bbox column
imageannotation["bbox"] = imageannotation.apply(lambda x: [x.TL_x, x.TL_y, x.BR_x, x.BR_y], axis=1)

# Get the columns we want as a dict
annotations_by_image = imageannotation.groupby("Imagename").agg({"bbox": list, "Species": list}).to_dict("index")

# Apply the annotations to the other df
imagemetadata["annotation"] = imagemetadata.Imagename.map(annotations_by_image)
print(imagemetadata)

The output is

                              Imagename                                         annotation
0  201503.20150619.181140817.204628.jpg  {'bbox': [[272.0, 142.375, 382.5, 340.0]], 'Sp...
1  201503.20150619.181141498.204632.jpg  {'bbox': [[267.75, 6.375, 422.875, 80.75]], 'S...
2  201503.20150619.181141662.204633.jpg  {'bbox': [[820.25, 78.625, 973.25, 382.5], [12...
3  201503.20150619.181141834.204634.jpg  {'bbox': [[194.0, 281.0, 233.0, 336.0]], 'Spec...