Recently I'm struggling to read an csv file with pandas pd.read_csv. The problem is, that in the csv file a comma is used both as decimal point and as separator for columns. The csv looks as follows:
wavelength,intensity
390,0,382
390,1,390
390,2,400
390,3,408
390,4,418
390,5,427
390,6,437
390,7,447
390,8,457
390,9,468
Pandas accordingly always splits the data into three separate columns. However the first comma is only the decimal point. I want to plot it with the wavelength (x-axis) with 390.0, 390.1, 390.2 nm and so on.
I must somehow tell pandas, that the first comma in line is the decimal point, and the second one is the separator. How do I do this?
Best
CodePudding user response:
I'm not sure that this is possible. It almost is, as you can see by the following example:
>>> pd.read_csv('test.csv', engine='python', sep=r',(?!\d $)')
wavelength intensity
0 390 0,382
1 390 1,390
2 390 2,400
3 390 3,408
4 390 4,418
5 390 5,427
6 390 6,437
7 390 7,447
8 390 8,457
9 390 9,468
...but the wrong comma is being split. I'll keep trying to see if it's possible ;)
Meanwhile, a simple solution would be to take advantage of the fact that that pandas puts part of the first column in the index:
df = (pd.read_csv('test.csv')
.reset_index()
.assign(wavelength=lambda x: x['index'].astype(str) '.' x['wavelength'].astype(str))
.drop('index', axis=1)
.astype({'wavelength': float}))
Output:
>>> df
wavelength intensity
0 390.0 382
1 390.1 390
2 390.2 400
3 390.3 408
4 390.4 418
5 390.5 427
6 390.6 437
7 390.7 447
8 390.8 457
9 390.9 468
EDIT: It is possible!
The following regular expression with a little dropna column-wise gets it done:
df = pd.read_csv('test.csv', engine='python', sep=r',(!?\w )$').dropna(axis=1, how='all')
Output:
>>> df
wavelength intensity
0 390,0 382
1 390,1 390
2 390,2 400
3 390,3 408
4 390,4 418
5 390,5 427
6 390,6 437
7 390,7 447
8 390,8 457
9 390,9 468