Home > OS >  Pandas read_csv: decimal and delimiter is the same character
Pandas read_csv: decimal and delimiter is the same character

Time:03-25

Recently I'm struggling to read an csv file with pandas pd.read_csv. The problem is, that in the csv file a comma is used both as decimal point and as separator for columns. The csv looks as follows:

wavelength,intensity
390,0,382
390,1,390
390,2,400
390,3,408
390,4,418
390,5,427
390,6,437
390,7,447
390,8,457
390,9,468

Pandas accordingly always splits the data into three separate columns. However the first comma is only the decimal point. I want to plot it with the wavelength (x-axis) with 390.0, 390.1, 390.2 nm and so on.

I must somehow tell pandas, that the first comma in line is the decimal point, and the second one is the separator. How do I do this?

Best

CodePudding user response:

I'm not sure that this is possible. It almost is, as you can see by the following example:

>>> pd.read_csv('test.csv', engine='python', sep=r',(?!\d $)')
   wavelength intensity
0         390     0,382
1         390     1,390
2         390     2,400
3         390     3,408
4         390     4,418
5         390     5,427
6         390     6,437
7         390     7,447
8         390     8,457
9         390     9,468

...but the wrong comma is being split. I'll keep trying to see if it's possible ;)

Meanwhile, a simple solution would be to take advantage of the fact that that pandas puts part of the first column in the index:

df = (pd.read_csv('test.csv')
    .reset_index()
    .assign(wavelength=lambda x: x['index'].astype(str)   '.'   x['wavelength'].astype(str))
    .drop('index', axis=1)
    .astype({'wavelength': float}))

Output:

>>> df
   wavelength  intensity
0       390.0        382
1       390.1        390
2       390.2        400
3       390.3        408
4       390.4        418
5       390.5        427
6       390.6        437
7       390.7        447
8       390.8        457
9       390.9        468

EDIT: It is possible!

The following regular expression with a little dropna column-wise gets it done:

df = pd.read_csv('test.csv', engine='python', sep=r',(!?\w )$').dropna(axis=1, how='all')

Output:

>>> df
  wavelength  intensity
0      390,0        382
1      390,1        390
2      390,2        400
3      390,3        408
4      390,4        418
5      390,5        427
6      390,6        437
7      390,7        447
8      390,8        457
9      390,9        468
  • Related