In the code below, I am trying to find the longest string in a DataFrame column.
Depending on the length of the column, the function below (maxstr), returns a single value for short columns (as expected), and a single element series for long columns (I didn't expect this).
Any pointers would be appreciated.
I used methods discussed in Find length of longest string in Pandas dataframe column
import numpy as np
import pandas as pd
As the data is large, I resort to displaying the information on the dataframe and series as I go along.
Read dataframe from clipboard
df = pd.read_clipboard(sep='\t', index_col=[0, 1, 2, 3, 4], na_values='')
print(f'{type(df)=}')
print(f'{df.shape=}')
print(f'{df.dtypes=}')
print(f'{df.columns=}')
type(df)=<class 'pandas.core.frame.DataFrame'>
df.shape=(581, 6)
df.dtypes=CID int64
TITLE object
FIRSTNAME object
FUNCTION object
PHONE object
EMAIL object
dtype: object
df.columns=Index(['CID', 'TITLE', 'FIRSTNAME', 'FUNCTION', 'PHONE', 'EMAIL'], dtype='object')
Function to return the maximum length string equivalent in a column/series
def maxstr(ser: pd.Series):
print(f'{type(ser)=}')
print(f'\n{type(ser.astype(str).str.len().idxmax())=}')
print(f'{type(ser[ser.astype(str).str.len().idxmax()])=}')
# should return a single value and not a series
return ser[ser.astype(str).str.len().idxmax()]
working with a short column (n=50), I get an int (as expected)
short = df.head(50)
short_return = maxstr(short['CID'])
type(ser)=<class 'pandas.core.series.Series'>
type(ser.astype(str).str.len().idxmax())=<class 'tuple'>
type(ser[ser.astype(str).str.len().idxmax()])=<class 'numpy.int64'>
woking with long columns from the same dataframe (same data) (n=100), I get a series (not expected ??)
long = df.head(100)
long_return = maxstr(long['CID'])
type(ser)=<class 'pandas.core.series.Series'>
type(ser.astype(str).str.len().idxmax())=<class 'tuple'>
type(ser[ser.astype(str).str.len().idxmax()])=<class 'pandas.core.series.Series'>
In both cases, we find the same int value (but one in a series, and the other as a single value)
short_return == long_return.iloc[0]
True
The int value is unique, so it occurs once in the dataframe column
value = short_return
print(f'The value: {value=}')
print(f'{sum(short["CID"] == value)=}')
print(f'{sum(long["CID"] == value)=}')
The value: value=1937
sum(short["CID"] == value)=1
sum(long["CID"] == value)=1
CodePudding user response:
In my opinion problem is duplicated index values, so if idxmax
return tuple
, which is duplicated, is returned not scalar, but all duplicated rows in selection.
Simple solution for avoid it is create default index, here change:
df = pd.read_clipboard(sep='\t', index_col=[0, 1, 2, 3, 4], na_values='')
to:
df = pd.read_clipboard(sep='\t', na_values='')
for no MultiIndex
, but default RangeIndex
.
Check it if RangeIndex
:
print (df.index)
Solution if need MultiIndex
is remove duplicated values:
df = pd.read_clipboard(sep='\t', index_col=[0, 1, 2, 3, 4], na_values='')
df = df[~df.index.duplicated()]