The original output looks like this:
JOBS column:
{"/j/03k50": "Waitress Job", "/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"}
{"/j/03k50": "Waitress Job", "/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"}
{"/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"} `
And I want something like this, so I want to remove the word "job" and the associated codes:
New JOBS column
{"Waitress", "Programmer", "Marketing"}
{"Waitress", "Programmer", "Marketing"}
{"Programmer", "Marketing"}
Before using the regex, I converted the column Jobs into a list (df_old) and I tried this:
df_new = [re.sub('^/j/', '', doc) for doc in df_old]
I had an error: TypeError: expected string or bytes-like object
, so I did this
df_new = [re.sub('^/j/', '', doc) for doc in str(df_old)
I had no errors but the output was horrible and was not conclusive in my objectives.
I hope you can help. Thank you in advance.
CodePudding user response:
As per the comment...there are far better ways of doing this. However, as a rough example direct to the question asked...
import pandas as pd
data = ['{"/j/03k50": "Waitress Job", "/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"}',
'{"/j/03k50": "Waitress Job", "/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"}',
'{"/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"} `']
df = pd.DataFrame(data, columns=['JOBS'])
df['Cleaned_JOBS'] = df['JOBS'].str.findall(r': (\".*?\sJob\"),?').str.join(', ')
df['Cleaned_JOBS'] = df['Cleaned_JOBS'].str.replace(' Job', '')
df['Cleaned_JOBS'] = '{' df['Cleaned_JOBS'] '}'
print(df, '\n\n')
Output:
JOBS Cleaned_JOBS
0 {"/j/03k50": "Waitress Job", "/j/055qm": "Prog... {"Waitress", "Programmer", "Marketing"}
1 {"/j/03k50": "Waitress Job", "/j/055qm": "Prog... {"Waitress", "Programmer", "Marketing"}
2 {"/j/055qm": "Programmer Job", "/j/02h40lc": "... {"Programmer", "Marketing"}