I am writing a piece of code which allows me to open a CSV file and remove nan rows and also find strings that are too long in the data frame. I want the program to say what row the length of data exceeds the 30-character limit and give you an option to exit or skip.
I previously had it set up so it would go by columns instead, however im finding it difficult to locate the string when its set up like this.
for column in df:
print(column,"->", df[column].astype(str).str.len().max())
if df[column].astype(str).str.len().max() > 30 and column != ('Column 17'):
print ("ERROR: Length of data exceeds 30 character limit")
abill=int(input("1.Continue through file.\n2.Exit\n"))
if abill==1:
continue
else:
sys.exit()
else:
continue
This is my code at the moment.
CodePudding user response:
I would recommend not to use a loop, but rather to vectorize.
So, you want to identify the strings longer than a threshold, except for excluded columns?
Assuming this example:
col1 col2 col3
0 abc A this_is_excluded
1 defghijkl BCDEF excluded
2 mnop GHIJKLMNOP excluded
If you want to mask the long strings:
exclude = ['col3'] # do not consider the columns in this list
threshold = 9 # consider strings longer or equal to threshold
mask = (df.drop(columns=exclude, errors='ignore')
.apply(lambda s: s.str.len().ge(threshold))
.reindex(columns=df.columns, fill_value=False)
)
out = df.mask(mask, '') # mask with empty string
Output:
col1 col2 col3
0 abc A this_is_excluded
1 BCDEF excluded
2 mnop excluded
If you want to drop the rows with long strings:
exclude = ['col3']
threshold = 9
mask = (df.drop(columns=exclude, errors='ignore')
.apply(lambda s: s.str.len().ge(threshold))
)
out = df.loc[~mask.any(axis=1)]
Output:
col1 col2 col3
0 abc A this_is_excluded
If you want to drop the columns with at least one too long string:
exclude = ['col3']
threshold = 9
mask = (df.drop(columns=exclude, errors='ignore')
.agg(lambda s: s.str.len().ge(threshold).any())
)
out = df.drop(columns=mask[mask].index)
Output:
col3
0 this_is_excluded
1 excluded
2 excluded