Pandas treating two similar csv files differently, treating numerical values as strings-CodePudding

I have an original .csv file that pandas is reading as expected, and a pruned version of the file with duplicate rows removed and everything else the same. However pandas is reading the numerical values in the second file as strings and unable to perform maths operations on the dataframe.

df1 = pd.read_csv("file1.csv")
print(df1)
df1 = (df1 - df1.min())/(df1.max() - df1.min())

             attr1   attr2     attr3 ...     attr7        attr  attr9
0            0.384  0.0893   -30.439  ...   75.499       140417     0
...            ...     ...       ...  ...      ...          ...   ...
2109         0.745  0.5430    -8.137  ...  139.964       185267     1

[2110 rows x 11 columns]
Process finished with exit code 0

df2 = pd.read_csv("file2.csv")
print(df2)
df2 = (df2 - df2.min())/(df2.max() - df2.min())
            attr1   attr2     attr3 ...     attr7        attr8 attr9
0           0.866  0.7300    -8.201  ...  118.523       379266     2
..            ...     ...       ...  ...      ...          ...   ...
1853        0.377  0.0156   -28.435  ...  140.179       186331     0

[1853 rows x 11 columns]
TypeError: unsupported operand type(s) for -: 'str' and 'str'

CodePudding user response：

The function pd.read_csv takes dtype as input, so that you can specify the type of each column. For example:

pd.read_csv('file1.csv', dtype={variable1: 'float'})

will read the column variable1 in as a floating type.

Alternatively, you can specify dtype after reading in the file, as such:

df1['variable1'] = df1['variable1'].astype(float)