Im having a trouble with editing a txt file on python.
Hi guys,
Im having a trouble with editing a txt file on python.
Here is the first few lines of the txt file
m0 $ 10 things i hate about you $ 1999 $ 6.90 $ 62847 $ ['comedy', 'romance']
m1 $ 1492: conquest of paradise $ 1992 $ 6.20 $ 10421 $ ['adventure', 'biography', 'drama', 'history']
here is my code:
import re
file = open('datasets/movie_titles_metadata.txt')
def extract_categories(file):
for line in file:
line: str = line.rstrip()
if re.search(" ", line):
line = re.sub(r"[0-9]", "", line)
line = re.sub(r"[$ : . ]", "", line)
return line
extract_categories(file)
i need to get an out put that looks like this:
['action', 'comedy', 'crime', 'drama', 'thriller']
can someone help?
CodePudding user response:
Regex is not the correct solution for this. Each of your lists is at the end of each line, so use str.rsplit
:
from io import StringIO
import ast
content = """m0 $ 10 things i hate about you $ 1999 $ 6.90 $ 62847 $ ['comedy', 'romance']
m1 $ 1492: conquest of paradise $ 1992 $ 6.20 $ 10421 $ ['adventure', 'biography', 'drama', 'history']"""
# this is a mock file-handle, use your file instead here
with StringIO(content) as fh:
genres = []
for line in fh:
# the 1 means that only 1 split occurs
_, lst = line.rsplit(' $ ', 1)
# use ast to convert the string representation
# to a python list
lst = ast.literal_eval(lst.strip())
# extend your result list
genres.extend(lst)
print(genres)
['comedy', 'romance', 'adventure', 'biography', 'drama', 'history']
CodePudding user response:
Alternatively, if you want to use regex instead:
def extract_categories(file):
categories = []
for line in file:
_, line = line.rsplit(' $ ', 1)
if re.search(r"\['[a-z] ", line):
res = re.findall(r"'([a-z] )'", line)
categories.extend(res)
return categories