How can I access and manage iterables inside each pandas.DataFrame column?-CodePudding

I have the following JSON file:

{
  "IMG1.tif": {
    "0": [
      100,
      192,
      [
        129,
        42,
        32
      ]
    ],
    "1": [
      299,
      208,
      [
        133,
        42,
        24
      ]
    ]
  },
  "IMG2.tif": {
    "0": [
      100,
      207,
      [
        128,
        41,
        34
      ]
    ],
    "1": [
      299,
      192,
      [
        81,
        25,
        26
      ]
    ]
  }
}

I'm reading into a dataframe with df = pd.read_json('img_data.json', orient = 'columns'). I find that this is a clear and logical way to store the information I want to store, but I want to access each of the values for each column and be able to iterate across/work with them.

For example, in this case, these values are coordinates. I'd like to, in the most convenient and natural way possible, be able to access the x, y or z axis value(s) for every coordinate in each column, i.e. (something like):

>>> df["IMG1.tif"][0,:]
0    100
1    299

or even filter across the whole dataframe:

>>> get_y_values(df)
   IMG1.tif   IMG2.tif
0    192        207
1    208        192

I also accept suggestions on how to change the way the data is stored (it may be necessary), but I don't think I can store values outside lists because of the way they're obtained - meaning that, as you can see,

"IMG.1.tif": { "0": [100, 192, [129, 42, 32]] ...

each 3-set of coordinates in the dataframe is shown inside a list.

In case some of you are curious or confused, z axis values are just RGB values. At some point I will need to transform them into grayscale inside the database, too:

>>> do_grayscale(df) # example values
        IMG1.tif          IMG2.tif
0    [100, 192, 61]    [100, 207, 87]
1    [299, 208, 122]   [299, 192, 94]

Added: one of the alternative ways to have the original data stored, albeit with sacrifices in the original code, would be something like this:

      x      y           z         image_name
0    100    192    [129, 42, 32]    IMG1.tif
1    299    208    [133, 42, 24]    IMG1.tif
2    100    207    [128, 41, 34]    IMG2.tif
3    299    192    [81, 25, 26]     IMG2.tif

CodePudding user response：

I'd suggest building a dataframe with multiindex columns:

df = df.T # first transpose your df

df_out = pd.concat([
  pd.DataFrame(df[col].tolist(), index=df.index,
    columns=pd.MultiIndex.from_tuples(zip([col]*3, ["x", "y", "z"]))
  ) for col in df.columns
], axis=1
)

This will give you the following df:

            0                        1                    
            x    y              z    x    y              z
IMG1.tif  100  192  [129, 42, 32]  299  208  [133, 42, 24]
IMG2.tif  100  207  [128, 41, 34]  299  192   [81, 25, 26]

You can then access any element of your frame with the locmethod. For instance:

df_out.loc['IMG1.tif', (0, "y") # returns 192
df_out.loc['IMG1.tif', ([0, 1], "x")] # returns a series with 100 and 299
df_out.loc[:, ([0, 1], "y")] # will get you all y values (granted you have only 0 and 1... edit accordingly)

Edit: if 0 and 1 are not relevant as index and you want the structure of your last example:

df = df.stack().reset_index(level=1)

df_out = pd.concat([
    pd.DataFrame(sub_df[0].tolist(), columns=["x", "y", "z"]).assign(image_name=img)
    for img, sub_df in df.groupby('level_1')
]).reset_index(drop=True)

Output:

     x    y              z image_name
0  100  192  [129, 42, 32]   IMG1.tif
1  299  208  [133, 42, 24]   IMG1.tif
2  100  207  [128, 41, 34]   IMG2.tif
3  299  192   [81, 25, 26]   IMG2.tif