Classifying pandas columns according to range limits-CodePudding

I have a dataframe with several numeric columns and their range goes either from 1 to 5 or 1 to 10

I want to create two lists of these columns names this way:

names_1to5 = list of all columns in df with numbers ranging from 1 to 5


names_1to10 = list of all columns in df with numbers from 1 to 10

Example:

IP  track  batch  size  type
1    2      3     5      A
9    1      2     8      B
10   5      5     10     C

from the dataframe above:

  names_1to5 = ['track', 'batch']
  names_1to10 = ['ip', 'size']

I want to use a function that gets a dataframe and perform the above transformation only on columns with numbers within those ranges.

I know that if the column 'max()' is 5 than it's 1to5 same idea when max() is 10

What I already did:

def test(df):
    list_1to5 = []
    list_1to10 = []
    
    for col in df:
        if df[col].max() == 5:
            list_1to5.append(col)
        else:
            list_1to10.append(col)
    return list_1to5, list_1to10

I tried the above but it's returning the following error msg:

'>=' not supported between instances of 'float' and 'str'

The type of the columns is 'object' maybe this is the reason. If this is the reason, how can I fix the function without the need to cast these columns to float as there are several, sometimes hundreds of these columns and if I run:

df['column'].max() I get 10 or 5

What's the best way to create this this function?

CodePudding user response：

Use:

string = """alpha IP  track  batch  size
A   1    2      3     5
B   9    1      2     8
C   10   5      5     10"""

temp = [x.split() for x in string.split('\n')]
cols = temp[0]
data = temp[1:]

def test(df):
    list_1to5 = []
    list_1to10 = []
    
    for col in df.columns:
        if df[col].dtype!='O':
            if df[col].max() == 5:
                list_1to5.append(col)
            else:
                list_1to10.append(col)
    return list_1to5, list_1to10

df = pd.DataFrame(data, columns = cols, dtype=float)

Output:

(['track', 'batch'], ['IP', 'size'])