I am trying to load a .csv file that contains 2 columns. The first column has floats and the second column has strings that correspond to each number in the 1st column.
I tried to load them in with file = np.genfromtxt('tester.csv',delimiter=',', skip_header=1)
but only the floats loaded. The texts all appeared as nan
in the array. What is the best way to load a .csv file into a 2d array with a column of floats and a column of strings?
The first few lines of the .csv file will look something like this
m/z, Lipid ID
885.5, PI 18:0_20:4
857.5, PI 16:0_20:4
834.5, PS 18:0_22:6
810.5, PS 18:0_20:4
790.5, PE 18:0_22:6
CodePudding user response:
Use pandas
to load your csv file, and then convert it to numpy
array using:
import numpy as np
import pandas as pd
df = pd.read_csv('tester.csv')
df_to_array = np.array(df)
Your csv will be stored in df_to_array as a numpy array.
CodePudding user response:
As you use numpy
, you can install pandas
to load your csv file:
# Python env: pip install pandas
# Anaconda env: conda install pandas
df = pd.read_csv('tester.csv', sep='\s\s ', engine='python')
CodePudding user response:
In order to avoid the nans, you need to tell genfromtxt
the dtypes of the columns, because, by default, it tries to make everything a float.
dtypes = ['float', 'object']
csv = np.array(np.genfromtxt('tester.csv',delimiter=',', skip_header=1, dtype=dtypes).tolist())
Output:
>>> csv
array([[885.5, b'PI 18:0_20:4'],
[857.5, b'PI 16:0_20:4'],
[834.5, b'PS 18:0_22:6'],
[810.5, b'PS 18:0_20:4'],
[790.5, b'PE 18:0_22:6']], dtype=object)
CodePudding user response:
In [228]: txt="""m/z, Lipid ID
...: 885.5, PI 18:0_20:4
...: 857.5, PI 16:0_20:4
...: 834.5, PS 18:0_22:6
...: 810.5, PS 18:0_20:4
...: 790.5, PE 18:0_22:6
...: """
genfromtxt
has a lot of possible parameters. It's not as fast as the pandas
equivalent, but still quite flixible.
In [229]: data = np.genfromtxt(txt.splitlines(),delimiter=',',dtype=None, encoding=None,
names=True, autostrip=True)
In [230]: data
Out[230]:
array([(885.5, 'PI 18:0_20:4'), (857.5, 'PI 16:0_20:4'),
(834.5, 'PS 18:0_22:6'), (810.5, 'PS 18:0_20:4'),
(790.5, 'PE 18:0_22:6')],
dtype=[('mz', '<f8'), ('Lipid_ID', '<U12')])
This is a structured array, with 2 fields. Because of the names
parameter, field names are taken from the file header line. With dtype=None
, it deduces a dtype for each column, in this case float and string. Fields are accessed by name:
In [231]: data['Lipid_ID']
Out[231]:
array(['PI 18:0_20:4', 'PI 16:0_20:4', 'PS 18:0_22:6', 'PS 18:0_20:4',
'PE 18:0_22:6'], dtype='<U12')
In [232]: data['mz']
Out[232]: array([885.5, 857.5, 834.5, 810.5, 790.5])
To make a 2d array we have to cast it to object dtype, allowing a mix of numbers and strings.
In [233]: np.array(data.tolist(), object)
Out[233]:
array([[885.5, 'PI 18:0_20:4'],
[857.5, 'PI 16:0_20:4'],
[834.5, 'PS 18:0_22:6'],
[810.5, 'PS 18:0_20:4'],
[790.5, 'PE 18:0_22:6']], dtype=object)
The structured arrays can be loaded into a dataframe, with a result similar to what a pandas read would produce:
In [235]: pd.DataFrame(data)
Out[235]:
mz Lipid_ID
0 885.5 PI 18:0_20:4
1 857.5 PI 16:0_20:4
2 834.5 PS 18:0_22:6
3 810.5 PS 18:0_20:4
4 790.5 PE 18:0_22:6
Dataframe to_records
produces a structured array, much like what we started with.
In [238]: _235.to_records(index=False)
Out[238]:
rec.array([(885.5, 'PI 18:0_20:4'), (857.5, 'PI 16:0_20:4'),
(834.5, 'PS 18:0_22:6'), (810.5, 'PS 18:0_20:4'),
(790.5, 'PE 18:0_22:6')],
dtype=[('mz', '<f8'), ('Lipid_ID', 'O')])