I wanted to normalize the values of my dataframe df using the following code:
# copy the data
df_min_max_scaled = df.copy()
# apply normalization techniques
for column in df_min_max_scaled.columns:
df_min_max_scaled[column] = (df_min_max_scaled[column] - df_min_max_scaled[column].min()) / (df_min_max_scaled[column].max() - df_min_max_scaled[column].min())
# view normalized data
print(df_min_max_scaled)
However, I got the error:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
So I tried to convert it into float:
# copy the data
df_min_max_scaled = df.copy()
# apply normalization techniques
for column in df_min_max_scaled.columns:
df_min_max_scaled[column] = (df_min_max_scaled[column]) - float(df_min_max_scaled[column].min()) / float((df_min_max_scaled[column].max()) - float(df_min_max_scaled[column].min())
# view normalized data
print(df_min_max_scaled)
Now I get the error:
Cell In [16], line 9
print(df_min_max_scaled)
^
SyntaxError: invalid syntax
Which I don't know why because it seems not a syntax error at all.
CodePudding user response:
You're getting a syntax error because you have an extra parantheses in the line where you're applying normalization.
But this is unnecessarily complicated – assuming you want all of the columns to be float type, then you can use astype function to convert all the columns from string
to float
before you apply the normalization:
For example:
import pandas as pd
## construct df with numeric strings
df = pd.DataFrame({'a':list('1234'),'b':list('4567')})
df_min_max_scaled = df.copy()
df_min_max_scaled = df_min_max_scaled.astype('float')
And then your original attempt should run:
for column in df_min_max_scaled.columns:
df_min_max_scaled[column] = (df_min_max_scaled[column] - df_min_max_scaled[column].min()) / (df_min_max_scaled[column].max() - df_min_max_scaled[column].min())
Output:
>>> df_min_max_scaled
a b
0 0.000000 0.000000
1 0.333333 0.333333
2 0.666667 0.666667
3 1.000000 1.000000
CodePudding user response:
- First of all, before applying any normalization, the type of the entire dataframe must be casted from 'str' to 'float' as far as it contains only numerical values.
DataFrame.astype(dtype, copy=True, errors='raise') Cast a pandas object to a specified dtype dtype.
df_min_max_scaled = df.astype("float")
- Looping over a dataframe rows is not considered a good practice because it's generally slower than vectorized methods that Pandas has.
DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs) Apply a function along an axis of the DataFrame.
df_min_max_scaled = df_min_max_scaled.apply(lambda x: (x - x.min()) / (x.max() - x.min()))
- You can also use rich standartization and normalization tools of scikit-learn library
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_min_max_scaled = pd.DataFrame(scaler.fit_transform(df_min_max_scaled), columns=df_min_max_scaled.columns)