Pandas Regex to Remove Preceding Characters and Column-CodePudding

The output of the Pandas Dataframe using the following code:

payload={}
files={}
headers = {
  'Accept': 'text/csv',
  'Authorization': 'Bearer '  token}

for k in request_dic.keys():
  base_url = "https://feeds.myfeed.com/api/"
  url = base_url   request_dic[k]
  print(url)

  response = requests.request("GET", url, headers=headers, data=payload, files=files)


  dt = pd.read_csv(StringIO(response.text),sep="|", encoding='base64')

Is:

Can someone help with a regex that will remove Ã¯Â»Â¿

CodePudding user response：

something like this maybe

import re
for k in request_dic.keys():
    base_url = "https://feeds.myfeed.com/api/"
    url = base_url   request_dic[k]
    print(url)

    response = requests.request("GET", url, headers=headers, data=payload, files=files)


    dt = pd.read_csv(StringIO(response.text),sep="|", encoding='base64')

    for col in dt.columns:
        dt.rename({col:re.findall('([A-Z].*)',col)[0]},inplace=True,axis=1)

CodePudding user response：

"".join([ch for ch in "Ã¯Â»Â¿COUNTRY ID" if str.isascii(ch)]).strip()

I prefer it, use something like it in rename method, like @SuperStew does

CodePudding user response：

Since you specifically ask for a regex, the following line will remove any characters which are not (^) in the upper- or lowercase alphabet (A-Za-z) and not a whitespace (\s).

dt.columns = dt.columns.str.replace('[^A-Za-z\s]', '')

If you have non-ASCII characters in your regular column names, you might need to adjust the regex. If you also need numbers you could add 0-9 to the regex.

Result:

	COUNTRY ID	COUNTRY NAME
0	10	Greece
1	10007	Romania