Split new-lines into separate keys-CodePudding

I have a dictionary that looks like this:

{'movies': ["1.\nBram Stoker's Dracula\n(1992)",
  '2.\nDracula\n(1931)',
  '3.\nHotel Transylvania\n(2012)',
  '4.\nBlade: Trinity\n(2004)',
  '5.\nDracula Untold\n(2014)',
  '6.\nThe Monster Squad\n(1987)',
  '7.\nNosferatu\n(1922)',
  '8.\nHotel Transylvania 3\n(2018)',
  '9.\nHotel Transylvania 2\n(2015)']}

I wanted to split this dictionary into two separate keys for example, as:

#expected output
{'movies': ["Bram Stoker's Dracula",
  "Dracula" ...], 'year':[1992, 1931 ...]}

I have tried this which was supposed to select for strings belonging after \n:

result = {}
for k,v in movies.items():
    result[k] = movies[k].lower().replace(' ', '_').split('\n')

but I get the error:

'list' object has no attribute 'lower'

CodePudding user response：

movies is a list not a dict:

Suppose this input:

d1 = {'movies': ["1.\nBram Stoker's Dracula\n(1992)",
  '2.\nDracula\n(1931)',
  '3.\nHotel Transylvania\n(2012)',
  '4.\nBlade: Trinity\n(2004)',
  '5.\nDracula Untold\n(2014)',
  '6.\nThe Monster Squad\n(1987)',
  '7.\nNosferatu\n(1922)',
  '8.\nHotel Transylvania 3\n(2018)',
  '9.\nHotel Transylvania 2\n(2015)']}

Create a dict to store your extracted data:

d2 = {'movies': [], 'year': []}
for row in d1['movies']:
    _, movie, year = row.split('\n')
    d2['movies'].append(movie)
    d2['year'].append(int(year[1:-1]))

No test is done. I guess all lines have the same format

Result output:

>>> d2
  'Dracula',
  'Hotel Transylvania',
  'Blade: Trinity',
  'Dracula Untold',
  'The Monster Squad',
  'Nosferatu',
  'Hotel Transylvania 3',
  'Hotel Transylvania 2'],
 'year': [1992, 1931, 2012, 2004, 2014, 1987, 1922, 2018, 2015]}

Update A more robust version with regex:

import re

d2 = {'movies': [], 'year': []}
for row in d1['movies']:
    sre = re.search(r'\d\.\n(.*)(?:\n?\((\d )\))?', row)
    movie = sre.group(1)
    year = int(sre.group(2)) if sre.group(2) else float('nan')
    d2['movies'].append(movie)
    d2['year'].append(year)

CodePudding user response：

I would probably use regular expressions to solve this kind of pattern-matching task. In the example below, the title and year are captured in groups.

import re

movie_pattern = re.compile(r"[0-9] \.\n([^\n] )\n\(([0-9] )\)")

movies_dict = {
    ...
}

split_dict = {"titles": [], "years": []}

for movie in movies_dict["movies"]:
    match = movie_pattern.fullmatch(movie)

    split_dict["titles"].append(match[1])
    split_dict["years"].append(int(match[2]))

print(split_dict)

Explanation:

[0-9] \.: One or more digits representing the index, followed by a literal .

\n: Literal \n

([^\n] ): One or more non-newline characters representing the title, in a capture group

\n: Literal \n

\(: Literal (

([0-9] ): One or more digits representing the year, in a capture group

\): Literal )

Edit: to handle the case you mention in your comment, where the year is sometimes missing, you can use the following pattern instead:

movie_pattern = re.compile(r"[0-9] \.\n([^\n] )(?:\n\(([0-9] )\))?")

The (?:) is a non-capturing group, and the ? afterwards signifies that the group is optional.

Then, if the year is missing, the value of the capture group is None, which you can handle like so:

for movie in movies_dict["movies"]:
    match = movie_pattern.fullmatch(movie)

    title = match[1]
    year = match[2]

    if year is not None:
        year = int(year)

    split_dict["titles"].append(title)
    split_dict["years"].append(year)

So if movies_dict looks like

movies_dict = {
    "movies": [
        ..., 
        "9.\nHotel Transylvania 2",  # only one newline, no year
    ],
}

then the output will be

{
    "titles": [..., "Hotel Transylvania 2"],
    "years": [..., None],
}

CodePudding user response：

For fun, here is a solution as a list comprehension (assuming movies_dict as input):

dict(zip(['movies', 'year'], zip(*[(a, int(b[1:-1]))
                                    for i in movies_dict['movies']
                                    for a,b in [i.split('\n')[1:]]
                                   ])))

output:

{'movies': ("Bram Stoker's Dracula",
  'Dracula',
  'Hotel Transylvania',
  'Blade: Trinity',
  'Dracula Untold',
  'The Monster Squad',
  'Nosferatu',
  'Hotel Transylvania 3',
  'Hotel Transylvania 2'),
 'year': (1992, 1931, 2012, 2004, 2014, 1987, 1922, 2018, 2015)}