NumPy converting mixed data types in 2D arrays via astype-CodePudding

I have been working on the IBTrACS dataset lately and would like to convert it to a 2D numpy array with the correct data types. I went through some filtering and selected the subset of data that I need, which is a 2D array with the following columns:

Column number - Data type
0 - integer (season)
1 - string (name)
2 - timestamp
3-4 - float-typed columns
5-20 - other integer-typed columns

I have also subsequently filled in empty values with placeholders, such as None (NaN) for floats and -99999 for integers. When I used astype to make numpy recognize the data types in the array, it apparently failed to process them column by column and tried to cast strings to integers even if there was no need.

The following is an MCVE.

Code:

import numpy as np
import csv
from datetime import datetime
import pytz

# reading the dataset
with open('ibtracs.WP.list.v04r00.csv', 'r') as file:
    data = list(csv.reader(file, delimiter=','))
# remove CSV headers
ds = np.array(data[2:])

# selecting subsets of the data
mask_jtwc = ds[:,17] == 'jtwc_wp'
ds_jtwc = ds[mask_jtwc,:]
# remove unnecessary columns
columns_to_drop = [3,4]   list(range(8,13))   [14,15,17,18,21,22, 25]   list(range(38,161))
ds_jtwc = np.delete(ds_jtwc, columns_to_drop, 1)
# further filtering
mask_nature = ds_jtwc[:,5] == 'TS'
ds_jtwc = ds_jtwc[mask_nature,:]
mask_tracktype = ds_jtwc[:,6] == 'main'
ds_jtwc = ds_jtwc[mask_tracktype,:]
mask_iflag = [True if item[0] != '_' else False for item in ds_jtwc[:,7]]
ds_jtwc = ds_jtwc[mask_iflag,:]
# remove columns that helped us perform the last step but not needed any more
columns_to_drop = [0,2,5,6,7]
ds_jtwc = np.delete(ds_jtwc, columns_to_drop, 1)
columns_to_drop = list(range(8, ds_jtwc.shape[1])) # representative columns only 
ds_jtwc = np.delete(ds_jtwc, columns_to_drop, 1)

# manual processing to handle empty data and timestamps
dataset = ds_jtwc.tolist()
converted_set = []
for row in dataset:
    converted_row = []
    for i in range(len(row)):
        if i == 1: # string type
            converted_row.append(str(row[i]))
        elif i == 2: # timestamp
            timestamp = datetime.strptime(row[i], '%Y-%m-%d %H:%M:%S')
            # timestamp = timestamp.replace(tzinfo=pytz.UTC) # no need timezones for modern numpy
            converted_row.append(timestamp)
        elif i == 3 or i == 4: # float type
            if row[i] == " ":
                converted_row.append(None) # NaN
            else: converted_row.append(float(row[i]))
        else: # default to integers
            if row[i] == " ":
                converted_row.append(-99999) # placeholder
            else: converted_row.append(int(row[i]))
    converted_set.append(converted_row)
dataset = np.array(converted_set)

# get sample data for reference
random_index = np.random.choice(dataset.shape[0], size=1, replace=False)
print("Sample data (row {0}):".format(random_index))
print(dataset[random_index, :])
print("Sample data (row 1):")
print(dataset[0])

### Code in question ###
print(dataset)
print(dataset.dtype)
dataset = dataset.astype([
    ('SEASON', 'i'), 
    ('NAME', 'S'),
    ('ISO_TIME', 'datetime64[s]'),
    ('USA_LAT', 'f'),('USA_LON', 'f'),
    ('USA_WIND', 'i'),('USA_PRES', 'i'),
    ('USA_R34_NE', 'i')
])
print(dataset)
print(dataset.dtype)

Output:

Sample data (row [38692]):
[[1999 'MAGGIE' datetime.datetime(1999, 6, 8, 6, 0) 23.6 111.0 20 -99999
  -99999]]
Sample data (row 1):
[1945 'ANN' datetime.datetime(1945, 4, 19, 12, 0) 9.5 160.3 25 -99999
 -99999]
[[1945 'ANN' datetime.datetime(1945, 4, 19, 12, 0) ... 25 -99999 -99999]
 [1945 'ANN' datetime.datetime(1945, 4, 19, 18, 0) ... 30 -99999 -99999]
 [1945 'ANN' datetime.datetime(1945, 4, 20, 0, 0) ... 35 -99999 -99999]
 ...
 [2019 'PHANFONE' datetime.datetime(2019, 12, 28, 12, 0) ... 25 1009
  -99999]
 [2019 'PHANFONE' datetime.datetime(2019, 12, 28, 18, 0) ... 20 1011
  -99999]
 [2019 'PHANFONE' datetime.datetime(2019, 12, 29, 0, 0) ... 20 1010
  -99999]]
object
Traceback (most recent call last):
  File "D:\path\Documents\Programming\path\Dataset\_forstackoverflow.py", line 64, in <module>
    dataset = dataset.astype([
ValueError: invalid literal for int() with base 10: 'ANN'

If I do not perform the astype step, the data type will turn out to be object, but I believe it will be more convenient in the future if the data are already of the right types. I have also tried to specify sizes, but it gave me an identical error.

Code:

dataset = dataset.astype([
    ('SEASON', 'i4'), 
    ('NAME', 'U16'),
    ('ISO_TIME', 'datetime64[s]'),
    ('USA_LAT', 'f'),('USA_LON', 'f'),
    ('USA_WIND', 'i4'),('USA_PRES', 'i4'),
    ('USA_R34_NE', 'i4')
])

I wonder what is wrong with or missing in my astype call. Thanks in advance!

CodePudding user response：

To illustrate my last comment:

In [9]: arr = np.array([[1,2,'word'],[3,4,'other']])
In [10]: arr
Out[10]: 
array([['1', '2', 'word'],
       ['3', '4', 'other']], dtype='<U21')
In [11]: arr.astype('i,i,U10')
Traceback (most recent call last):
  File "<ipython-input-11-3800d012c681>", line 1, in <module>
    arr.astype('i,i,U10')
ValueError: invalid literal for int() with base 10: 'word'

But if I make a list of tuples:

In [14]: alist = [tuple(row) for row in arr]
In [15]: alist
Out[15]: [('1', '2', 'word'), ('3', '4', 'other')]
In [16]: np.array(alist, dtype='i,i,U10')
Out[16]: 
array([(1, 2, 'word'), (3, 4, 'other')],
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<U10')])

In [17]: import numpy.lib.recfunctions as rf
In [19]: rf.unstructured_to_structured(arr, np.dtype('i,i,U10'))
Out[19]: 
array([(1, 2, 'word'), (3, 4, 'other')],
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<U10')])