data type error in normalization in python-CodePudding

I wanted to normalize the values of my dataframe df using the following code:

    # copy the data
df_min_max_scaled = df.copy()

# apply normalization techniques
for column in df_min_max_scaled.columns:
    df_min_max_scaled[column] = (df_min_max_scaled[column] - df_min_max_scaled[column].min()) / (df_min_max_scaled[column].max() - df_min_max_scaled[column].min()) 

# view normalized data
print(df_min_max_scaled)

However, I got the error:

TypeError: unsupported operand type(s) for -: 'str' and 'str'

So I tried to convert it into float:

# copy the data
df_min_max_scaled = df.copy()

# apply normalization techniques
for column in df_min_max_scaled.columns:
    df_min_max_scaled[column] = (df_min_max_scaled[column]) - float(df_min_max_scaled[column].min()) / float((df_min_max_scaled[column].max()) - float(df_min_max_scaled[column].min()) 

# view normalized data
print(df_min_max_scaled)

Now I get the error:

  Cell In [16], line 9
    print(df_min_max_scaled)
    ^
SyntaxError: invalid syntax

Which I don't know why because it seems not a syntax error at all.

CodePudding user response：

You're getting a syntax error because you have an extra parantheses in the line where you're applying normalization.

But this is unnecessarily complicated – assuming you want all of the columns to be float type, then you can use astype function to convert all the columns from string to float before you apply the normalization:

For example:

import pandas as pd

## construct df with numeric strings
df = pd.DataFrame({'a':list('1234'),'b':list('4567')})

df_min_max_scaled = df.copy()
df_min_max_scaled = df_min_max_scaled.astype('float')

And then your original attempt should run:

for column in df_min_max_scaled.columns:
    df_min_max_scaled[column] = (df_min_max_scaled[column] - df_min_max_scaled[column].min()) / (df_min_max_scaled[column].max() - df_min_max_scaled[column].min())

Output:

>>> df_min_max_scaled
          a         b
0  0.000000  0.000000
1  0.333333  0.333333
2  0.666667  0.666667
3  1.000000  1.000000

CodePudding user response：

First of all, before applying any normalization, the type of the entire dataframe must be casted from 'str' to 'float' as far as it contains only numerical values.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html?highlight=astype#pandas.DataFrame.astype

DataFrame.astype(dtype, copy=True, errors='raise') Cast a pandas object to a specified dtype dtype.

df_min_max_scaled = df.astype("float")

Looping over a dataframe rows is not considered a good practice because it's generally slower than vectorized methods that Pandas has.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html?highlight=apply#pandas.DataFrame.apply

DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs) Apply a function along an axis of the DataFrame.

df_min_max_scaled = df_min_max_scaled.apply(lambda x: (x - x.min()) / (x.max() - x.min()))

You can also use rich standartization and normalization tools of scikit-learn library

https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_min_max_scaled = pd.DataFrame(scaler.fit_transform(df_min_max_scaled), columns=df_min_max_scaled.columns)