Home > Net >  regex : how to keep relevant words and remove other?
regex : how to keep relevant words and remove other?

Time:11-17

The original output looks like this:

JOBS column:

{"/j/03k50": "Waitress Job", "/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"}

{"/j/03k50": "Waitress Job", "/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"}

{"/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"} `

And I want something like this, so I want to remove the word "job" and the associated codes:

New JOBS column
{"Waitress", "Programmer", "Marketing"}

{"Waitress", "Programmer", "Marketing"}

{"Programmer", "Marketing"}

Before using the regex, I converted the column Jobs into a list (df_old) and I tried this:

df_new = [re.sub('^/j/', '', doc) for doc in df_old]

I had an error: TypeError: expected string or bytes-like object, so I did this

df_new = [re.sub('^/j/', '', doc) for doc in str(df_old)

I had no errors but the output was horrible and was not conclusive in my objectives.

I hope you can help. Thank you in advance.

CodePudding user response:

As per the comment...there are far better ways of doing this. However, as a rough example direct to the question asked...

import pandas as pd

data = ['{"/j/03k50": "Waitress Job", "/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"}',
'{"/j/03k50": "Waitress Job", "/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"}',
'{"/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"} `']

df = pd.DataFrame(data, columns=['JOBS'])

df['Cleaned_JOBS'] = df['JOBS'].str.findall(r': (\".*?\sJob\"),?').str.join(', ')
df['Cleaned_JOBS'] = df['Cleaned_JOBS'].str.replace(' Job', '')

df['Cleaned_JOBS'] = '{'   df['Cleaned_JOBS']   '}'

print(df, '\n\n')

Output:

    JOBS                                                Cleaned_JOBS
0   {"/j/03k50": "Waitress Job", "/j/055qm": "Prog...   {"Waitress", "Programmer", "Marketing"}
1   {"/j/03k50": "Waitress Job", "/j/055qm": "Prog...   {"Waitress", "Programmer", "Marketing"}
2   {"/j/055qm": "Programmer Job", "/j/02h40lc": "...   {"Programmer", "Marketing"}
  • Related