I'm trying to clean my dataset I scrapped the data from abjjad. I have five columns book_title, author, Cover_url, genres, and descriptions.
for the genres column, the data that I scraped has the following syntax
روايات وقصص روايات اجتماعية | روايات وقصص روايات واقعية |
here is an image of exactly how it looks in vscode
so I wanted to turn this into a list with each genre being in a separate cell. Genres are separated by a new line and by '|'. first, I used this line to remove the '|'
df = pd.read_csv("/data/abjjad.csv",converters={'genres': lambda x: x[1:-1].split('|')})
I was able to achieve this
['روايات وقصص\nروايات اجتماعية\n', '\nروايات وقصص\nروايات واقعية\n']
but the desired output is this `
['روايات وقصص' ,'روايات اجتماعية','روايات وقصص', 'روايات واقعية']
I've looked into many questions similar to mine but haven't found a solution that works for me.
- removing newlines from messy strings in pandas dataframe cells?
- Remove '\n' in text in pandas python
- Removing /N character from a column in Python Dataframe
CodePudding user response:
First, you can try splitting on both "|" and "\n":
import re
converters={'genres': lambda x: re.split('\||\n', x[1:-1])}
Then remove empty strings if present:
df.genres = df.genres.apply(lambda x: [s for s in x if s])