In pandas, it is possible to do the following:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 99919 entries, 0 to 99918
Data columns (total 47 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 reason 99919 non-null object
Most of these entries contain []
, and some contain an actual object, e.g. [{'a':'x', 'b':'y', 'c':'z'}]
I want to efficiently (without using df.apply
) set all entries that contain only the empty list to e.g. None
.
I thought that I might be able to do something like the following:
df[df['reasons'].str.len() == 0]['reasons'] = None
but this does not work since it is setting it on a copy of the dataframe (see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy)
Usually in the case where I set something on df[row][column]
, I can instead do:
df.loc[:, (row, column)]
to set it on the actual dataframe, but that does not work in this case since df['reasons'].str.len() == 0
returns a series which is unhashable and not a valid argument for loc
.
Is there any way to do this without using apply
?
CodePudding user response:
The syntax should be:
df.loc[df['reasons'].str.len() == 0, 'reasons'] = None
The correct use of loc
is loc[row, col]
, not loc[:, (row, col)]
df.loc[:, (X, Y)]
can be used if you have a MultiIndex (see below for an example), but this is not the case here.
df = pd.DataFrame(None, index=range(2),
columns=pd.MultiIndex.from_product([['A', 'B'], [1, 2]]))
df.loc[:, ('A', 1)]
CodePudding user response:
example:
df = pd.DataFrame([[2, []], [1, None], [1,3]], columns=['col1', 'col2'])
output(df
)
col1 col2
0 2 []
1 1 None
2 1 3
- when convert to bool dtype, empty list convert to False
where
make False to NaN
df.where(df.astype('bool'))
output:
col1 col2
0 2 NaN
1 1 NaN
2 1 3
if chk only col2
column
df['col2'].where(df['col2'].astype('bool'))