I am trying to import data from a text file with a varying number of columns and insert it into an array of arrays. I know that the first column will always be a string and the next three columns will be integers, but so far I have only managed to read the file as an array of tuples
i have tried using dtype=(object,int,int,int)
from io import StringIO
import numpy as np
new_string = StringIO("01/23/2020, 32, 0, 2 \n01/31/2020' ,436 ,0 ,10")
new_result = np.genfromtxt(new_string, dtype=(object,int,int,int), encoding="unicode"
, delimiter=",")
print("File data:",new_result )
output:
File data: [('01/23/2020', 32, 0, 2) ("01/31/2020' ", 436, 0, 10)]
I want the output tolook like this
[['01/23/2020' 32 0 2]
['01/31/2020' 436 0 10]]
to that
new_result == np.array( [['01/23/2020',32,0,2],
['01/31/2020', 436, 0, 10]],dtype=object)
will be True
CodePudding user response:
This should work for your problem
import numpy as np
example_string = "01/23/2020, 32, 0, 2 \n01/31/2020' ,436 ,0 ,10"
example_string_filtered = example_string.replace(' ','').replace("'",'')
newline_split = example_string_filtered.split('\n')
result = []
for line in newline_split:
line_split = line.split(',')
result.append([line_split[0], int(line_split[1]), int(line_split[2]) ,int(line_split[3])])
result = np.array(result, dtype='O')
print(result)
result: [['01/23/2020', 32, 0, 2], ['01/31/2020', 436, 0, 10]]
CodePudding user response:
Specifying a dtype
like that produces a structured array
https://numpy.org/doc/stable/user/basics.rec.html
In [40]: new_string = StringIO("01/23/2020, 32, 0, 2 \n01/31/2020' ,436 ,0 ,10")
...: new_result = np.genfromtxt(new_string, dtype=(object,int,int,int), encoding="unicode"
...: , delimiter=",")
This is a 1d array, with a compound dtype. The print
display just shows the elements, or records, as tuples, but the repr
display shows the dtype
as well:
In [41]: new_result
Out[41]:
array([(b'01/23/2020', 32, 0, 2), (b"01/31/2020' ", 436, 0, 10)],
dtype=[('f0', 'O'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
In [42]: new_result.dtype
Out[42]: dtype([('f0', 'O'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
Fields are accessed by name:
In [43]: new_result['f0']
Out[43]: array([b'01/23/2020', b"01/31/2020' "], dtype=object)
In [44]: new_result['f1']
Out[44]: array([ 32, 436])
The main structured array doc page suggests using a recfunctions
function to convert dtypes:
In [46]: import numpy.lib.recfunctions as rf
Unfortunately the object
field is giving that problems:
In [48]: arr = rf.structured_to_unstructured(new_result, dtype=object)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [48], in <cell line: 1>()
----> 1 arr = rf.structured_to_unstructured(new_result, dtype=object)
File <__array_function__ internals>:5, in structured_to_unstructured(*args, **kwargs)
File ~\anaconda3\lib\site-packages\numpy\lib\recfunctions.py:980, in structured_to_unstructured(arr, dtype, copy, casting)
978 with suppress_warnings() as sup: # until 1.16 (gh-12447)
979 sup.filter(FutureWarning, "Numpy has detected")
--> 980 arr = arr.view(flattened_fields)
982 # next cast to a packed format with all fields converted to new dtype
983 packed_fields = np.dtype({'names': names,
984 'formats': [(out_dtype, dt.shape) for dt in dts]})
File ~\anaconda3\lib\site-packages\numpy\core\_internal.py:494, in _view_is_safe(oldtype, newtype)
491 return
493 if newtype.hasobject or oldtype.hasobject:
--> 494 raise TypeError("Cannot change data-type for object array.")
495 return
Let's try the dtype=None
option (and clean up the string a bit):
In [49]: new_string = StringIO("01/23/2020, 32, 0, 2 \n01/31/2020 ,436 ,0 ,10")
...: new_result = np.genfromtxt(new_string, dtype=None, encoding="unicode"
...: , delimiter=",")
In [50]: new_result
Out[50]:
array([('01/23/2020', 32, 0, 2), ('01/31/2020 ', 436, 0, 10)],
dtype=[('f0', '<U11'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
Same as your case except the string dtype field.
But that doesn't help; it must be the target dtype that the function doesn't like (or both):
In [51]: arr = rf.structured_to_unstructured(new_result, dtype=object)
...
TypeError: Cannot change data-type for object array.
But we can convert the numeric fields, producing a 2d int array:
In [52]: arr = rf.structured_to_unstructured(new_result[['f1','f2','f3']], dtype=int)
In [53]: arr
Out[53]:
array([[ 32, 0, 2],
[436, 0, 10]])
Assigning fields to object array
In [65]: new_string = "01/23/2020, 32, 0, 2 \n01/31/2020, 436 ,0 ,10".splitlines()
...: new_result = np.genfromtxt(new_string, dtype='O,i,i,i', encoding="unicode"
...: , delimiter=",")
In [66]: new_result
Out[66]:
array([(b'01/23/2020', 32, 0, 2), (b'01/31/2020', 436, 0, 10)],
dtype=[('f0', 'O'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
Create a target array:
In [67]: arr = np.empty((2,4),object)
In [68]: for i,f in enumerate(new_result.dtype.fields):
...: arr[:,i] = new_result[f]
...:
In [69]: arr
Out[69]:
array([[b'01/23/2020', 32, 0, 2],
[b'01/31/2020', 436, 0, 10]], dtype=object)
Many of the recfunctions
do something like this - create a target array, and copy data by field name. Usually a structured array has many more records than fields, so this iteration by field is relatively efficient.
unpack
If you specify unpack
, the result is separate arrays for each column/field
In [74]: new_string = "01/23/2020, 32, 0, 2 \n01/31/2020, 436 ,0 ,10".splitlines()
...: new_result = np.genfromtxt(new_string, dtype='O,i,i,i', unpack=True
...: , delimiter=",")
In [75]: new_result
Out[75]:
[array([b'01/23/2020', b'01/31/2020'], dtype=object),
array([ 32, 436], dtype=int32),
array([0, 0], dtype=int32),
array([ 2, 10], dtype=int32)]
They can then be concatenated with stack
:
In [77]: np.stack(new_result, axis=1)
Out[77]:
array([[b'01/23/2020', 32, 0, 2],
[b'01/31/2020', 436, 0, 10]], dtype=object)