Deleting characters in arrays while converting rows to integer in DataFrame-CodePudding

I have a dataframe like below

Each row has a different length.

I'm trying to run the code below

for i in range(len(df)):
  df['ColumnA'][i] = df['ColumnA'][i].astype(int)

I'm getting an error because some rows have like "64B", "64A" string values.

For example,

df['ColumnA'][37]

Output:

array(['34', '35', '36', '38', '39', '40', '41', '56', '58', '59', '60',
       '61', '62', '62A', '62B', '63', '64', '65', '88', '90', '94', '98'],
      dtype='<U3')

You can easily see the '62A', '62B' string values.

My goal is to convert all strings to int (e.g from 62A to 62). Deleting the characters next to it.

CodePudding user response：

you can do this by using list comprehension too:

import re
import pandas as pd
import numpy as np

df = pd.DataFrame({'colA': [np.array(['34A', '35', '36A', '38', '39', '40', '41', '56', '58', '59', '60',
       '61', '62', '62A', '62B', '63', '64', '65', '88', '90', '94', '98'],
       dtype='<U3') for _ in range(10)]})

print('before : \n', df)

df['colA'] = [[int(re.sub('[^0-9]','', el)) for el in row] for row in df['colA']]

print('after : \n', df)

output:

before : 
                                                 colA
0  [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
1  [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
2  [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
3  [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
4  [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
5  [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
6  [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
7  [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
8  [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
9  [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
after :
                                                 colA
0  [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
1  [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
2  [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
3  [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
4  [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
5  [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
6  [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
7  [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
8  [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
9  [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...

CodePudding user response：

You can use the sub function in the builtin regex module to replace every non-digit character (\D selects non-digits in a regex) with a blank "".

A list comprehension will let you apply that same method to every item in the list.

The last component you need is the apply method in pandas that will apply the same function to every item in a series.

import re

import pandas as pd


# sample data
df = pd.Series([
    [61, '62a', 70, 'z8z8z'],
    [61, '62hello', 70],
], name='columnA').to_frame()

# show sample data
df
#>                 columnA
#> 0  [61, 62a, 70, z8z8z]
#> 1     [61, 62hello, 70]

# select the column and apply the function to each element
df.columnA.apply(
    lambda row: [
        int(re.sub("\D", "", str(x)))  # cast to string, replace any non-digit character, cast to int
        for x in row  # for every item in the list
    ]
)
#> 0    [61, 62, 70, 88]
#> 1        [61, 62, 70]
#> Name: columnA, dtype: object

This will result in a list stored as each item in the Series "columnA". If you want these to be numpy arrays instead, you can wrap the list comprehension in a call to np.array.