Unicode issues with the list. Unable to resolve it in python-CodePudding

I am extracting data / dataframe from a website using pandas in the following manner:

import pandas as pd

jockeys = 'https://race.kra.co.kr/globalEn/jockeysBusan.do'
jdf = pd.read_html(jockeys)[0]

jdf_list = jdf.values.tolist()
print(jdf_list)

The result I am getting is the following (only adding first few results):

[[1,
  'Chae Sang Hyun',
  'FREE',
  '2014/06/05',
  '262 (16/19/22)',
  '1789 (130/153/162)'],
 [2,
  'Choi Eun Gyeong',
  'FREE',
  '2016/06/18',
  '317 (19/22/38)',
  '1522 (90/120/140)'],
 [3,
  'Choi Si Dae',
  'FREE',
  '2007/05/18',
  '409 (58/34/34)',
  '5649 (750/658/594)'],
 [4,
  'Francisco Da Silva',
  'FREE',
  '2016/09/02',
  '375 (61/45/42)',
  '2255 (309/300/261)'],
 [5,
  '(-4)\xa0Gwon O Chan',
  'FREE',
  '2021/07/15',
  '154 (4/12/10)',
  '200 (4/14/10)']]

I keep getting this "(-4)\xa0" before names. And I have tried the following few techniques but in vain:

jdf_list_new =  jdf_list.encode('ascii', 'ignore').decode('utf-8')

and

jdf_list_new = unicodedata.normalize("NFKC", jdf_list)

Need help here!

CodePudding user response：

\xa0 is Unicode Character 'NO-BREAK SPACE'. You need to encode and decode the column in the dataframe before getting the list ((-4) is part of the table in the website)

jdf = pd.read_html(jockeys)[0]
jdf['(allowance)Name'] = jdf['(allowance)Name'].str.encode('ascii', 'ignore').str.decode('utf-8')

Output

[1, 'Chae Sang Hyun', 'FREE', '2014/06/05', '262 (16/19/22)', '1789 (130/153/162)']
[2, 'Choi Eun Gyeong', 'FREE', '2016/06/18', '317 (19/22/38)', '1522 (90/120/140)']
[3, 'Choi Si Dae', 'FREE', '2007/05/18', '409 (58/34/34)', '5649 (750/658/594)']
[4, 'Francisco Da Silva', 'FREE', '2016/09/02', '375 (61/45/42)', '2255 (309/300/261)']
[5, '(-4)Gwon O Chan', 'FREE', '2021/07/15', '154 (4/12/10)', '200 (4/14/10)']
[6, 'Jeon Jin Gu', 'FREE', '2017/06/02', '183 (2/12/7)', '914 (47/64/52)']
[7, 'Jeong Dong Cheol', 'FREE', '2011/08/24', '141 (6/3/4)', '2724 (169/183/195)']
[8, 'Jeong Woo Ju', 'FREE', '2018/06/14', '143 (2/6/8)', '987 (48/50/68)']
[9, 'Jo In Kwon', 'FREE', '2008/06/18', '355 (37/53/40)', '4592 (649/533/491)']
[10, 'Jung Do Yun', 'FREE', '2016/06/18', '260 (29/28/25)', '1921 (162/157/194)']
[11, 'Kim Cheol Ho', 'FREE', '2008/06/18', '164 (8/8/14)', '2640 (217/219/240)']
[12, 'Kim Eu Soo', 'FREE', '2005/05/04', '270 (11/13/14)', '4102 (243/306/344)']
[13, 'Kim Hye Sun', 'FREE', '2009/06/01', '415 (46/57/44)', '4275 (350/374/363)']
[14, '(-4)Lee Hong Rag', 'FREE', '2022/07/01', '91 (6/9/8)', '91 (6/9/8)']
[15, 'Lee Sung Jae', 'FREE', '2008/05/14', '396 (34/23/35)', '4244 (327/333/398)']
[16, 'Lim Sung Sil', 'FREE', '2002/09/13', '94 (5/8/14)', '2648 (353/296/279)']
[17, 'Mo Jun Ho', 'FREE', '2020/07/15', '340 (17/17/26)', '755 (45/54/64)']
[18, 'Park Jae I', 'FREE', '2015/06/17', '390 (62/52/50)', '2239 (167/223/227)']
[19, '(-4)Park Jong Ho', 'FREE', '2020/07/15', '74 (1/2/5)', '282 (8/7/14)']
[20, '(-2)Seo Gang Ju', 'FREE', '2021/07/15', '342 (28/41/40)', '385 (28/44/46)']
[21, 'Seo Seung Un', 'FREE', '2011/08/24', '368 (61/55/46)', '3973 (620/540/491)']
[22, '(-2)Shin Yun Seob', 'FREE', '2021/07/15', '313 (16/22/28)', '407 (24/26/38)']
[23, 'Song Kyeong Yun', 'FREE', '2007/05/18', '391 (39/34/40)', '4765 (361/450/461)']
[24, '(-3)Yoon Hyung Seok', 'FREE', '2021/07/15', '268 (13/19/23)', '317 (14/24/24)']
[25, 'You Hyun Myung', 'FREE', '2002/09/13', '387 (73/49/42)', '7104 (1199/940/750)']

CodePudding user response：

Couldn't solve it using decoding as well, but to remove "(-4)\xa0" from a dataset in Pandas, you can use the apply method of the DataFrame object to apply a function to each value in the dataset. The function can use the replace method of the string object to replace the occurrence of "(-4)\xa0" with an empty string.

import pandas as pd

# Load the dataset
df = pd.read_csv('data.csv')

# Define a function to remove (-4)\xa0 from a string
def remove_string(s):
  return s.replace("(-4)\xa0", "")

# Apply the function to the dataset
df['column_name'] = df['column_name'].apply(remove_string)

In this example, data.csv is the name of the file that contains the dataset, and column_name is the name of the column in the dataset that you want to modify. The apply method applies the remove_string function to each value in the column_name column, and the modified values are assigned back to the column_name column.