Keep only numerical information from numpy array of strings-CodePudding

I want to keep only numbers from an numpy array of strings, which are not necessarily valid. My code looks looks like the following:

age = train['age'].to_numpy() # 200k values
set(age)
# {'1', '2', '3', '7-11', np.nan...} 

age  = np.array(['1', '2', '3', '7-11', np.nan])

Desired output: np.array([1, 2, 3]). Ideally, '7-11' would be 7, however, that's not simple and is a tolerable loss.

np.isfinite(x) gives "ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''"

x = [num for num in age if isinstance(num, (int, float))] returns []

CodePudding user response：

You could do something like the following

for pos, val in enumerate(age):
    try:
        new_val = int(val)
    except:
        new_val = np.nan
    age[pos] = new_val

age = age[age!="nan"].astype(int)

print(age)
> array([1, 2, 3])

CodePudding user response：

Here's an option that will split strings on '-' first, and only take the first value, so '7-11' is converted to 7:

age = np.array(['1', '2', '3', '7-11', np.nan])
age_int = np.array([int(x[0]) for x in np.char.split(age, sep='-') if x[0].isdecimal()])

Output: array([1, 2, 3, 7])

There is a more efficient way to do this if you don't care about cases like '7-11':

age_int2 = age[np.char.isdecimal(age)].astype(int)

Output2: array([1, 2, 3])