I have a dataframe like below
Each row has a different length.
I'm trying to run the code below
for i in range(len(df)):
df['ColumnA'][i] = df['ColumnA'][i].astype(int)
I'm getting an error because some rows have like "64B", "64A" string values.
For example,
df['ColumnA'][37]
Output:
array(['34', '35', '36', '38', '39', '40', '41', '56', '58', '59', '60',
'61', '62', '62A', '62B', '63', '64', '65', '88', '90', '94', '98'],
dtype='<U3')
You can easily see the '62A', '62B' string values.
My goal is to convert all strings to int (e.g from 62A to 62). Deleting the characters next to it.
CodePudding user response:
you can do this by using list comprehension too:
import re
import pandas as pd
import numpy as np
df = pd.DataFrame({'colA': [np.array(['34A', '35', '36A', '38', '39', '40', '41', '56', '58', '59', '60',
'61', '62', '62A', '62B', '63', '64', '65', '88', '90', '94', '98'],
dtype='<U3') for _ in range(10)]})
print('before : \n', df)
df['colA'] = [[int(re.sub('[^0-9]','', el)) for el in row] for row in df['colA']]
print('after : \n', df)
output:
before :
colA
0 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
1 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
2 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
3 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
4 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
5 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
6 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
7 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
8 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
9 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
after :
colA
0 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
1 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
2 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
3 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
4 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
5 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
6 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
7 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
8 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
9 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
CodePudding user response:
You can use the sub
function in the builtin regex module to replace every non-digit character (\D
selects non-digits in a regex) with a blank ""
.
A list comprehension will let you apply that same method to every item in the list.
The last component you need is the apply
method in pandas that will apply the same function to every item in a series.
import re
import pandas as pd
# sample data
df = pd.Series([
[61, '62a', 70, 'z8z8z'],
[61, '62hello', 70],
], name='columnA').to_frame()
# show sample data
df
#> columnA
#> 0 [61, 62a, 70, z8z8z]
#> 1 [61, 62hello, 70]
# select the column and apply the function to each element
df.columnA.apply(
lambda row: [
int(re.sub("\D", "", str(x))) # cast to string, replace any non-digit character, cast to int
for x in row # for every item in the list
]
)
#> 0 [61, 62, 70, 88]
#> 1 [61, 62, 70]
#> Name: columnA, dtype: object
This will result in a list stored as each item in the Series "columnA". If you want these to be numpy arrays instead, you can wrap the list comprehension in a call to np.array
.