I am trying to read information from a column in my csv file and use it to create a new column. Please help
I imported the csv file and printed the first 10 rows ( header) but now I would like to create a column for the years in the title column.
```
import csv
from itertools import islice
from operator import itemgetter
#opening the CSV file
with open('/home/raymondossai/movies.csv', mode ='r')as file:
#reading the CSV file
csvFile = csv.reader(file)
#displaying the contents of the CSV file
for row in islice(csvFile, 11): # first 10 only
print(row)
```
Result:
['movieId', 'title', 'genres']
['1', 'Toy Story (1995)', 'Adventure|Animation|Children|Comedy|Fantasy']
['2', 'Jumanji (1995)', 'Adventure|Children|Fantasy']
['3', 'Grumpier Old Men (1995)', 'Comedy|Romance']
['4', 'Waiting to Exhale (1995)', 'Comedy|Drama|Romance']
['5', 'Father of the Bride Part II (1995)', 'Comedy']
['6', 'Heat (1995)', 'Action|Crime|Thriller']
['7', 'Sabrina (1995)', 'Comedy|Romance']
['8', 'Tom and Huck (1995)', 'Adventure|Children']
['9', 'Sudden Death (1995)', 'Action']
['10', 'GoldenEye (1995)', 'Action|Adventure|Thriller']
CodePudding user response:
You can use re
to extract the year from the title
:
rows = [
["movieId", "title", "genres"],
["1", "Toy Story (1995)", "Adventure|Animation|Children|Comedy|Fantasy"],
["2", "Jumanji (1995)", "Adventure|Children|Fantasy"],
["3", "Grumpier Old Men (1995)", "Comedy|Romance"],
["4", "Waiting to Exhale (1995)", "Comedy|Drama|Romance"],
["5", "Father of the Bride Part II (1995)", "Comedy"],
["6", "Heat (1995)", "Action|Crime|Thriller"],
["7", "Sabrina (1995)", "Comedy|Romance"],
["8", "Tom and Huck (1995)", "Adventure|Children"],
["9", "Sudden Death (1995)", "Action"],
["10", "GoldenEye (1995)", "Action|Adventure|Thriller"],
]
import re
pat = re.compile(r"\((\d{4})\)")
for movie_id, title, genres in rows[1:]:
year = pat.search(title)
print([movie_id, title, genres, year.group(1) if year else "N/A"])
Prints:
['1', 'Toy Story (1995)', 'Adventure|Animation|Children|Comedy|Fantasy', '1995']
['2', 'Jumanji (1995)', 'Adventure|Children|Fantasy', '1995']
['3', 'Grumpier Old Men (1995)', 'Comedy|Romance', '1995']
['4', 'Waiting to Exhale (1995)', 'Comedy|Drama|Romance', '1995']
['5', 'Father of the Bride Part II (1995)', 'Comedy', '1995']
['6', 'Heat (1995)', 'Action|Crime|Thriller', '1995']
['7', 'Sabrina (1995)', 'Comedy|Romance', '1995']
['8', 'Tom and Huck (1995)', 'Adventure|Children', '1995']
['9', 'Sudden Death (1995)', 'Action', '1995']
['10', 'GoldenEye (1995)', 'Action|Adventure|Thriller', '1995']
CodePudding user response:
you should definetly use pandas for this, it's way easier to work with tables and way cleaner too.
try to read in the csv file like so:
import pandas as pd
df = pd.read_csv('/home/raymondossai/movies.csv')
the df object is basically your csv table represented as an object in python.
for having the year as an extra column you could use the str.split() method since the year always follows after a ' (' expression:
# get the 4 characters of the year (first 4 characters after the ' (' expression)
df['Year'] = df['title'].str.split(pat=' (', expand=True)[1][:4].astype(int)