how to edit txt file with regular expressions (re) in python-CodePudding

Im having a trouble with editing a txt file on python.

Hi guys,

Im having a trouble with editing a txt file on python.

Here is the first few lines of the txt file

m0    $    10 things i hate about you    $    1999    $    6.90    $    62847    $    ['comedy', 'romance']
m1    $    1492: conquest of paradise    $    1992    $    6.20    $    10421    $    ['adventure', 'biography', 'drama', 'history']

here is my code:

import re

file = open('datasets/movie_titles_metadata.txt')

def extract_categories(file):

    for line in file:
        line: str = line.rstrip()
        if re.search(" ", line):
            line = re.sub(r"[0-9]", "", line)
            line = re.sub(r"[$   : . ]", "", line)
            return line
        
      
    
extract_categories(file)

i need to get an out put that looks like this:

['action', 'comedy', 'crime', 'drama', 'thriller'] can someone help?

CodePudding user response：

Regex is not the correct solution for this. Each of your lists is at the end of each line, so use str.rsplit:

from io import StringIO
import ast

content = """m0    $    10 things i hate about you    $    1999    $    6.90    $    62847    $    ['comedy', 'romance']
m1    $    1492: conquest of paradise    $    1992    $    6.20    $    10421    $    ['adventure', 'biography', 'drama', 'history']"""

# this is a mock file-handle, use your file instead here
with StringIO(content) as fh:
    genres = []

    for line in fh:
        # the 1 means that only 1 split occurs
        _, lst = line.rsplit('   $   ', 1)

        # use ast to convert the string representation
        # to a python list
        lst = ast.literal_eval(lst.strip())

        # extend your result list
        genres.extend(lst)

print(genres)
['comedy', 'romance', 'adventure', 'biography', 'drama', 'history']

CodePudding user response：

Alternatively, if you want to use regex instead:

def extract_categories(file):
    categories = []

    for line in file:
        _, line = line.rsplit('   $   ', 1)
        if re.search(r"\['[a-z] ", line):
            res = re.findall(r"'([a-z] )'", line)
            categories.extend(res)

    return categories