Home > Blockchain >  removing new lines from a data frame column
removing new lines from a data frame column

Time:08-13

I'm trying to clean my dataset I scrapped the data from abjjad. I have five columns book_title, author, Cover_url, genres, and descriptions. for the genres column, the data that I scraped has the following syntax

روايات وقصص روايات اجتماعية | روايات وقصص روايات واقعية |

here is an image of exactly how it looks in vscode

so I wanted to turn this into a list with each genre being in a separate cell. Genres are separated by a new line and by '|'. first, I used this line to remove the '|'

df = pd.read_csv("/data/abjjad.csv",converters={'genres': lambda x: x[1:-1].split('|')})

I was able to achieve this

['روايات وقصص\nروايات اجتماعية\n', '\nروايات وقصص\nروايات واقعية\n']

but the desired output is this `

 ['روايات وقصص' ,'روايات اجتماعية','روايات وقصص', 'روايات واقعية']

I've looked into many questions similar to mine but haven't found a solution that works for me.

CodePudding user response:

First, you can try splitting on both "|" and "\n":

import re
converters={'genres': lambda x: re.split('\||\n', x[1:-1])}

Then remove empty strings if present:

df.genres = df.genres.apply(lambda x: [s for s in x if s])
  • Related