Home > Blockchain >  Remove special characters from object type columns of Data Frame and combine all string letter to fo
Remove special characters from object type columns of Data Frame and combine all string letter to fo

Time:09-08

enter image description here

I have this column in my dataset. Tried to solve it by removing [ and { symbols from this as the whole column is a type string. Each character is a string including , : and spaces. while removing {. it only removes starting and ending not the in-between.

I want to extract only name values like 'Comedy', 'Thriller', ect.

CodePudding user response:

import numpy as np
import pandas as pd
import re

df = pd.DataFrame([["[{'id': 35, 'name': 'Comedy'}]", np.NaN], ["[{'id': 35, 'name': 'Comedy'}]", np.NaN]], columns=['genres', 'homepage'])

df['genres'] = df.apply(lambda x: re.search(r"'name':\s*'(.*)'", x['genres'], re.DOTALL).group(1), axis=1)

print(df)

Result:

   genres  homepage
0  Comedy       NaN
1  Comedy       NaN

UPDATE:

If you have rows that contains more than a genre, you can extract all genres using this:

import numpy as np
import pandas as pd
import json

def get_genres(string):
    string = string.replace("\'", "\"")
    j = json.loads(string)
    genres = [obj['name'] for obj in j]
    return ','.join(genres) # use `return genres[0]` if you want only the first genre found

df = pd.DataFrame([["[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}]", np.NaN], ["[{'id': 35, 'name': 'Comedy'}]", np.NaN]], columns=['genres', 'homepage'])
df['genres'] = df['genres'].apply(get_genres)
print(df)

Result:

         genres  homepage
0  Comedy,Drama       NaN
1        Comedy       NaN

CodePudding user response:

def convt(str):
    return eval(str)


df['genre'] = df['genre'].apply(convt)
  • Related