How would you go about finding longest string per row in a data frame?-CodePudding

I am writing a piece of code which allows me to open a CSV file and remove nan rows and also find strings that are too long in the data frame. I want the program to say what row the length of data exceeds the 30-character limit and give you an option to exit or skip.

I previously had it set up so it would go by columns instead, however im finding it difficult to locate the string when its set up like this.

for column in df:
        print(column,"->", df[column].astype(str).str.len().max())
        if df[column].astype(str).str.len().max() > 30 and column != ('Column 17'):
            print ("ERROR: Length of data exceeds 30 character limit")
            abill=int(input("1.Continue through file.\n2.Exit\n"))
            if abill==1:
                continue
            else:
                sys.exit()
        else:
            continue

This is my code at the moment.

CodePudding user response：

I would recommend not to use a loop, but rather to vectorize.

So, you want to identify the strings longer than a threshold, except for excluded columns?

Assuming this example:

        col1        col2              col3
0        abc           A  this_is_excluded
1  defghijkl       BCDEF          excluded
2       mnop  GHIJKLMNOP          excluded

If you want to mask the long strings:

exclude = ['col3']  # do not consider the columns in this list
threshold = 9       # consider strings longer or equal to threshold

mask = (df.drop(columns=exclude, errors='ignore')
          .apply(lambda s: s.str.len().ge(threshold))
          .reindex(columns=df.columns, fill_value=False)
       )

out = df.mask(mask, '') # mask with empty string

Output:

   col1   col2              col3
0   abc      A  this_is_excluded
1        BCDEF          excluded
2  mnop                 excluded

If you want to drop the rows with long strings:

exclude = ['col3']
threshold = 9

mask = (df.drop(columns=exclude, errors='ignore')
          .apply(lambda s: s.str.len().ge(threshold))
        )

out = df.loc[~mask.any(axis=1)]

Output:

  col1 col2              col3
0  abc    A  this_is_excluded

If you want to drop the columns with at least one too long string:

exclude = ['col3']
threshold = 9

mask = (df.drop(columns=exclude, errors='ignore')
          .agg(lambda s: s.str.len().ge(threshold).any())
       )

out = df.drop(columns=mask[mask].index)

Output:

               col3
0  this_is_excluded
1          excluded
2          excluded