Home > database >  Assign new value to a cell in pd.DataFrame which is a pd.Series when series index isn't unique
Assign new value to a cell in pd.DataFrame which is a pd.Series when series index isn't unique

Time:08-13

Here is my data if anyone wants to try to reproduce the problem: https://github.com/LunaPrau/personal/blob/main/O_paired.csv

I have a pd.DataFrame (called O) of 1402 rows × 1402 columns with columns and index both as ['XXX-icsd', 'YYY-icsd', ...] and cell values as some np.float64, some np.nan and problematically, some as pandas.core.series.Series.

202324-icsd 644068-icsd 27121-icsd 93847-icsd 154319-icsd
202324-icsd 0.000000 0.029729 NaN 0.098480 0.097867
644068-icsd NaN 0.000000 NaN 0.091311 0.091049
27121-icsd 0.144897 0.137473 0.0 0.081610 0.080442
93847-icsd NaN NaN NaN 0.000000 0.005083
154319-icsd NaN NaN NaN NaN 0.000000

The problem is that some cells (e.g. O.loc["192693-icsd", "192401-icsd"]) contain a pandas.core.series.Series of form:

192693-icsd    0.129562
192693-icsd    0.129562
Name: 192401-icsd, dtype: float64

I'm struggling to make this cell contain only a np.float64. I tried:

O.loc["192693-icsd", "192401-icsd"] = O.loc["192693-icsd", "192401-icsd"][0]

and other various known forms of assignnign a new value to a cell in pd.DataFrame, but they only assign a new element to the same series in this cell, e.g. if I do

O.loc["192693-icsd", "192401-icsd"] = 5

then when calling O.loc["192693-icsd", "192401-icsd"] I get:

192693-icsd    5.0
192693-icsd    5.0
Name: 192401-icsd, dtype: float64

How to modify O.loc["192693-icsd", "192401-icsd"] so that it is of type np.float64?

CodePudding user response:

It's not that df.loc["192693-icsd", "192401-icsd"] contain a Series, your index just isn't unique. This is especially obvious looking at these outputs:

>>> df.loc["192693-icsd"]
             202324-icsd  644068-icsd  27121-icsd  93847-icsd  154319-icsd  28918-icsd  28917-icsd  ...  108768-icsd  194195-icsd  174188-icsd  159632-icsd  89111-icsd  23308-icsd  253341-icsd
192693-icsd          NaN          NaN         NaN         NaN     0.146843         NaN         NaN  ...          NaN     0.271191          NaN          NaN         NaN         NaN     0.253996
192693-icsd          NaN          NaN         NaN         NaN     0.146843         NaN         NaN  ...          NaN     0.271191          NaN          NaN         NaN         NaN     0.253996

[2 rows x 1402 columns]

# And the fact that this returns the same:
>>> df.at["192693-icsd", "192401-icsd"]
192693-icsd    0.129562
192693-icsd    0.129562
Name: 192401-icsd, dtype: float64

You can fix this with a groupby, but you have to decide what to do with the non-unique groups. It looks like they're the same, so we'll combine them with max:

df = df.groupby(level=0).max()

Now it'll work as expected:

>>> df.loc["192693-icsd", "192401-icsd"])
0.129562120551387

Your non-unique values are:

>>> df.index[df.index.duplicated()]
Index(['193303-icsd', '192693-icsd', '416602-icsd'], dtype='object')

CodePudding user response:

IIUC, you can try DataFrame.applymap to check each cell type and get the first row if it is Series

df = df.applymap(lambda x: x.iloc[0] if type(x) == pd.Series else x)

CodePudding user response:

It works as expected for O.loc["192693-icsd", "192401-icsd"] = O.loc["192693-icsd", "192401-icsd"][0]

Check this colab link: https://colab.research.google.com/drive/1XFXuj4OBu8GXQx6DTqv04XellmFcFWbC?usp=sharing

  • Related