Assign new value to a cell in pd.DataFrame which is a pd.Series when series index isn't unique-CodePudding

Here is my data if anyone wants to try to reproduce the problem: https://github.com/LunaPrau/personal/blob/main/O_paired.csv

I have a pd.DataFrame (called O) of 1402 rows × 1402 columns with columns and index both as ['XXX-icsd', 'YYY-icsd', ...] and cell values as some np.float64, some np.nan and problematically, some as pandas.core.series.Series.

	202324-icsd	644068-icsd	27121-icsd	93847-icsd	154319-icsd
202324-icsd	0.000000	0.029729	NaN	0.098480	0.097867
644068-icsd	NaN	0.000000	NaN	0.091311	0.091049
27121-icsd	0.144897	0.137473	0.0	0.081610	0.080442
93847-icsd	NaN	NaN	NaN	0.000000	0.005083
154319-icsd	NaN	NaN	NaN	NaN	0.000000

The problem is that some cells (e.g. O.loc["192693-icsd", "192401-icsd"]) contain a pandas.core.series.Series of form:

192693-icsd    0.129562
192693-icsd    0.129562
Name: 192401-icsd, dtype: float64

I'm struggling to make this cell contain only a np.float64. I tried:

O.loc["192693-icsd", "192401-icsd"] = O.loc["192693-icsd", "192401-icsd"][0]

and other various known forms of assignnign a new value to a cell in pd.DataFrame, but they only assign a new element to the same series in this cell, e.g. if I do

O.loc["192693-icsd", "192401-icsd"] = 5

then when calling O.loc["192693-icsd", "192401-icsd"] I get:

192693-icsd    5.0
192693-icsd    5.0
Name: 192401-icsd, dtype: float64

How to modify O.loc["192693-icsd", "192401-icsd"] so that it is of type np.float64?

CodePudding user response：

It's not that df.loc["192693-icsd", "192401-icsd"] contain a Series, your index just isn't unique. This is especially obvious looking at these outputs:

>>> df.loc["192693-icsd"]
             202324-icsd  644068-icsd  27121-icsd  93847-icsd  154319-icsd  28918-icsd  28917-icsd  ...  108768-icsd  194195-icsd  174188-icsd  159632-icsd  89111-icsd  23308-icsd  253341-icsd
192693-icsd          NaN          NaN         NaN         NaN     0.146843         NaN         NaN  ...          NaN     0.271191          NaN          NaN         NaN         NaN     0.253996
192693-icsd          NaN          NaN         NaN         NaN     0.146843         NaN         NaN  ...          NaN     0.271191          NaN          NaN         NaN         NaN     0.253996

[2 rows x 1402 columns]

# And the fact that this returns the same:
>>> df.at["192693-icsd", "192401-icsd"]
192693-icsd    0.129562
192693-icsd    0.129562
Name: 192401-icsd, dtype: float64

You can fix this with a groupby, but you have to decide what to do with the non-unique groups. It looks like they're the same, so we'll combine them with max:

df = df.groupby(level=0).max()

Now it'll work as expected:

>>> df.loc["192693-icsd", "192401-icsd"])
0.129562120551387

Your non-unique values are:

>>> df.index[df.index.duplicated()]
Index(['193303-icsd', '192693-icsd', '416602-icsd'], dtype='object')

CodePudding user response：

IIUC, you can try DataFrame.applymap to check each cell type and get the first row if it is Series

df = df.applymap(lambda x: x.iloc[0] if type(x) == pd.Series else x)

CodePudding user response：

It works as expected for O.loc["192693-icsd", "192401-icsd"] = O.loc["192693-icsd", "192401-icsd"][0]

Check this colab link: https://colab.research.google.com/drive/1XFXuj4OBu8GXQx6DTqv04XellmFcFWbC?usp=sharing