I'm struggling to find a solution to this problem which is why I'm here.
I have a dataframe column num_list that contains letters and numbers:
df['num_list']
0 "8E"
1 "5E"
2 "19A"
3 "16E"
4 "26D"
...
539032 "5E"
539033 "6E"
539034 "16E"
539035 "7E"
539036 "5E"
Name: carweb_abi2_50, Length: 539037, dtype: object
I want to remove all the letters and quotation marks. I've managed the letters part getting to here:
0 8
1 5
2 19
3 16
4 26
..
Name: carweb_abi2_50, Length: 539037, dtype: object
However, I can't convert to integer and when I check the unique elements for the column I see this:
array(['8', '5', '19', '16', '26', '24', '15', '14', '6', '28', '18',
'20', '7', '41', '25', '31', '17', '9', '12', '4', '23', '10',
'27', '40', '30', '3', '21', '13', '22', '11', '33', '42', '34',
'32', '36', '1', '2', '39', '', '29', '37', 0, '38', '43', '35',
'45', '44', '47', '46', '49', '48', '50', '0'], dtype=object)
Which shows the nan values I replaced with zero are actual number 0 but all the other values are quoted for some reason.
I've tried extracting only the integers into a new column but no luck.
TYIA
CodePudding user response:
You can use regex:
df["num_list"] = df["num_list"].str.replace(r'\D ', '', regex=True)
and then convert to Integer:
df["num_list"] = df["num_list"].astype(int)