Remove special characters from object type columns of Data Frame and combine all string letter to fo-CodePudding

enter image description here

I have this column in my dataset. Tried to solve it by removing [ and { symbols from this as the whole column is a type string. Each character is a string including , : and spaces. while removing {. it only removes starting and ending not the in-between.

I want to extract only name values like 'Comedy', 'Thriller', ect.

CodePudding user response：

import numpy as np
import pandas as pd
import re

df = pd.DataFrame([["[{'id': 35, 'name': 'Comedy'}]", np.NaN], ["[{'id': 35, 'name': 'Comedy'}]", np.NaN]], columns=['genres', 'homepage'])

df['genres'] = df.apply(lambda x: re.search(r"'name':\s*'(.*)'", x['genres'], re.DOTALL).group(1), axis=1)

print(df)

Result:

   genres  homepage
0  Comedy       NaN
1  Comedy       NaN

UPDATE:

If you have rows that contains more than a genre, you can extract all genres using this:

import numpy as np
import pandas as pd
import json

def get_genres(string):
    string = string.replace("\'", "\"")
    j = json.loads(string)
    genres = [obj['name'] for obj in j]
    return ','.join(genres) # use `return genres[0]` if you want only the first genre found

df = pd.DataFrame([["[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}]", np.NaN], ["[{'id': 35, 'name': 'Comedy'}]", np.NaN]], columns=['genres', 'homepage'])
df['genres'] = df['genres'].apply(get_genres)
print(df)

Result:

         genres  homepage
0  Comedy,Drama       NaN
1        Comedy       NaN

CodePudding user response：

def convt(str):
    return eval(str)


df['genre'] = df['genre'].apply(convt)