Home > Back-end >  Involutive (up to precision) operations "dataframe to csv" and "csv to dataframe"
Involutive (up to precision) operations "dataframe to csv" and "csv to dataframe"

Time:11-29

I have a numerically really intensive vectorized python function def f(x,y) in two variables that I evaluate (with frompyfunc and broadcasting) on a np.array X = [x0, ...., xN-1] of x's and a np.array Y = [y0, ...., yM-1] of y's with N,M between 5 and 10 thousands. This returns as result a 2D np.array Z of shape (N,M) containing z[i,j]'s such that z[i,j] = f(X[i], Y[j]) for all i and j. The function f is optimized and this already takes roughly 45 minutes. As I then write, debug, profile code using Z, I want to "save" the "matrix" Z in a csv file, in this kind of format :

    0.25        0.5         0.75        1           1.25
0.1 0.876155737 0.888282356 0.904731158 0.910351368 0.906284762
0.2 0.810528369 0.797068044 0.806520168 0.805697704 0.80659234
0.3 0.696280633 0.704307378 0.703540949 0.705198518 0.708672067
0.4 0.601264163 0.605194    0.607882    0.611616655 0.612408848
0.5 0.502995372 0.509209974 0.513651558 0.516065068 0.51994982

(this is a tiny upper-left part of my "matrix", the first column being the beginning of X and the first line being the beginning of Y and the rest being the matrix, meaning by that that for instance f(0.4, 0.75) = 0.607882.

I naturally used a pd.dataframe as follows :

df = pd.DataFrame(data=Z, columns=Y, index=X)
df.to_csv(some_full_path_filename)

and indeed the csv file looks like I want it to look, that is, like the small bit of the matrix above.

Now if I

df2 = pd.read_csv(some_full_path_filename)
df2 .to_csv(some_full_path_filename2, index=False)

the second csv file looks like :

Unnamed:0   0.25        0.5         0.75        1           1.25
0.1         0.876155737 0.888282356 0.904731158 0.910351368 0.906284762
0.2         0.810528369 0.797068044 0.806520168 0.805697704 0.80659234
0.3         0.696280633 0.704307378 0.703540949 0.705198518 0.708672067
0.4         0.601264163 0.605194    0.607882    0.611616655 0.612408848
0.5         0.502995372 0.509209974 0.513651558 0.516065068 0.51994982

which is the closest to the first csv file I succeeded to get while trying myself with pandas. And of course, the two dataframes df and df2 are not "equal".

Hence the question's title : an operation being involutive when applying it two times gives the starting value, then no, my "dataframe to csv file" and "csv file to dataframe" operations are not involutive.

To be precise there are floating point rounding differences in the dataframes and the csv files, like in one I could have 0.0072618782055291 in the matrix but in the other one at the same place I could have 0.0072618782055290999999999 : this is not a problem for me.

What I would like is my "dataframe to csv file" and "csv file to dataframe" operations to give dataframes and csv files structurally equal.

"Structurally" meaning :

  • for the csv files : to have the same values (up to rounding) and strings (if any) in every "cell"
  • for the dataframes : of course they won't be equal per se as they don't "point" to the same place in allocated memory but I would like them the be equal in the sense that all numerical/text values in them represent same numbers/strings (up to rounding for numbers)

CodePudding user response:

It should be difference, because in csv all data are saved like strings, so if use index_col=0 here is correctly create FloatIndex, but columns names are strings, also data in columns should be parsed differently (e.g. if mixed strings and numeric):

f = 'file.csv'
df.to_csv(f)

df = pd.read_csv(f, index_col=0)
print (df)
         0.25       0.5      0.75         1      1.25
0.1  0.876156  0.888282  0.904731  0.910351  0.906285
0.2  0.810528  0.797068  0.806520  0.805698  0.806592
0.3  0.696281  0.704307  0.703541  0.705199  0.708672
0.4  0.601264  0.605194  0.607882  0.611617  0.612409
0.5  0.502995  0.509210  0.513652  0.516065  0.519950


print (df.columns)
Index(['0.25', '0.5', '0.75', '1.0', '1.25'], dtype='object')

Another idea is use pickle, read_pickle and DataFrame.to_pickle for correct save DataFrames with columns and index:

print (df.columns)
Float64Index([0.25, 0.5, 0.75, 1.0, 1.25], dtype='float64')

f = 'file'
df.to_pickle(f)

df1 = pd.read_pickle(f)
print (df1)
         0.25      0.50      0.75      1.00      1.25
0.1  0.876156  0.888282  0.904731  0.910351  0.906285
0.2  0.810528  0.797068  0.806520  0.805698  0.806592
0.3  0.696281  0.704307  0.703541  0.705199  0.708672
0.4  0.601264  0.605194  0.607882  0.611617  0.612409
0.5  0.502995  0.509210  0.513652  0.516065  0.519950

print (df1.columns)
Float64Index([0.25, 0.5, 0.75, 1.0, 1.25], dtype='float64')

print (df1.equals(df))
True
  • Related