I have a dictionary that looks like this:
{'movies': ["1.\nBram Stoker's Dracula\n(1992)",
'2.\nDracula\n(1931)',
'3.\nHotel Transylvania\n(2012)',
'4.\nBlade: Trinity\n(2004)',
'5.\nDracula Untold\n(2014)',
'6.\nThe Monster Squad\n(1987)',
'7.\nNosferatu\n(1922)',
'8.\nHotel Transylvania 3\n(2018)',
'9.\nHotel Transylvania 2\n(2015)']}
I wanted to split this dictionary into two separate keys for example, as:
#expected output
{'movies': ["Bram Stoker's Dracula",
"Dracula" ...], 'year':[1992, 1931 ...]}
I have tried this which was supposed to select for strings belonging after \n
:
result = {}
for k,v in movies.items():
result[k] = movies[k].lower().replace(' ', '_').split('\n')
but I get the error:
'list' object has no attribute 'lower'
CodePudding user response:
movies
is a list not a dict:
Suppose this input:
d1 = {'movies': ["1.\nBram Stoker's Dracula\n(1992)",
'2.\nDracula\n(1931)',
'3.\nHotel Transylvania\n(2012)',
'4.\nBlade: Trinity\n(2004)',
'5.\nDracula Untold\n(2014)',
'6.\nThe Monster Squad\n(1987)',
'7.\nNosferatu\n(1922)',
'8.\nHotel Transylvania 3\n(2018)',
'9.\nHotel Transylvania 2\n(2015)']}
Create a dict to store your extracted data:
d2 = {'movies': [], 'year': []}
for row in d1['movies']:
_, movie, year = row.split('\n')
d2['movies'].append(movie)
d2['year'].append(int(year[1:-1]))
No test is done. I guess all lines have the same format
Result output:
>>> d2
'Dracula',
'Hotel Transylvania',
'Blade: Trinity',
'Dracula Untold',
'The Monster Squad',
'Nosferatu',
'Hotel Transylvania 3',
'Hotel Transylvania 2'],
'year': [1992, 1931, 2012, 2004, 2014, 1987, 1922, 2018, 2015]}
Update A more robust version with regex:
import re
d2 = {'movies': [], 'year': []}
for row in d1['movies']:
sre = re.search(r'\d\.\n(.*)(?:\n?\((\d )\))?', row)
movie = sre.group(1)
year = int(sre.group(2)) if sre.group(2) else float('nan')
d2['movies'].append(movie)
d2['year'].append(year)
CodePudding user response:
I would probably use regular expressions to solve this kind of pattern-matching task. In the example below, the title and year are captured in groups.
import re
movie_pattern = re.compile(r"[0-9] \.\n([^\n] )\n\(([0-9] )\)")
movies_dict = {
...
}
split_dict = {"titles": [], "years": []}
for movie in movies_dict["movies"]:
match = movie_pattern.fullmatch(movie)
split_dict["titles"].append(match[1])
split_dict["years"].append(int(match[2]))
print(split_dict)
Explanation:
[0-9] \.
: One or more digits representing the index, followed by a literal .
\n
: Literal \n
([^\n] )
: One or more non-newline characters representing the title, in a capture group
\n
: Literal \n
\(
: Literal (
([0-9] )
: One or more digits representing the year, in a capture group
\)
: Literal )
Edit: to handle the case you mention in your comment, where the year is sometimes missing, you can use the following pattern instead:
movie_pattern = re.compile(r"[0-9] \.\n([^\n] )(?:\n\(([0-9] )\))?")
The (?:)
is a non-capturing group, and the ?
afterwards signifies that the group is optional.
Then, if the year is missing, the value of the capture group is None
, which you can handle like so:
for movie in movies_dict["movies"]:
match = movie_pattern.fullmatch(movie)
title = match[1]
year = match[2]
if year is not None:
year = int(year)
split_dict["titles"].append(title)
split_dict["years"].append(year)
So if movies_dict
looks like
movies_dict = {
"movies": [
...,
"9.\nHotel Transylvania 2", # only one newline, no year
],
}
then the output will be
{
"titles": [..., "Hotel Transylvania 2"],
"years": [..., None],
}
CodePudding user response:
For fun, here is a solution as a list comprehension (assuming movies_dict
as input):
dict(zip(['movies', 'year'], zip(*[(a, int(b[1:-1]))
for i in movies_dict['movies']
for a,b in [i.split('\n')[1:]]
])))
output:
{'movies': ("Bram Stoker's Dracula",
'Dracula',
'Hotel Transylvania',
'Blade: Trinity',
'Dracula Untold',
'The Monster Squad',
'Nosferatu',
'Hotel Transylvania 3',
'Hotel Transylvania 2'),
'year': (1992, 1931, 2012, 2004, 2014, 1987, 1922, 2018, 2015)}