Home > Back-end >  Multiline string to dictionary
Multiline string to dictionary

Time:10-05

Let's say that I have the following multiline string. We can assume that title is always followed by a line.

"""
This is title
-------------------------
Author: Name of the author

Sentence 1.

Sentence 2.
"""

And I want to convert that to a dict like this:

{
    "title": "This is title",
    "author": "Name of the author",
    "body": "sentence 1.\n\nSentence 2.",
}

How can I split the first two lines with all those "-----" and later split the rest of the line by new line? Could you please give me some suggestions?

CodePudding user response:

Assuming this is the standard layout of all strings given. You can use multiple assignment and str.split to get your values split into variables then construct your dict. You just need to use str.join to re-join the strings after being split apart:

s = """
This is title
-------------------------
Author: Name of the author

Sentence 1.

Sentence 2.
"""

_, title, _, author, *body = s.split('\n')

data = {
    "title": title,
    "author": ' '.join(author.split()[1:]),
    "body": '\n'.join(body)
}

PrettyPrinted output:

{'title': 'This is title',
 'author': 'Name of the author',
 'body': '\nSentence 1.\n\nSentence 2.\n'}

Although this works it is a bit messy/adhoc for real world application. Please see if you can reformat how your data is supplied/stored if you want a more concrete solution.

CodePudding user response:

This is how I solved this:

What part could be considered static?

IMO, the title, title line, and author can be considered static (so to speak), because the title line is a reliable delimiter for the other 2. RegEx allows us to describe a string format that we expect so, I wrote a regex format to describe the parts that we are considering static.

How do I get the rest?

RegEx also allows us to store the start and end positions of our matches. Properly storing this information we can create a list of offsets that pinpoint all of the gaps in our matches. That information is "the rest".

Why not just add the body into the format?

I couldn't figure it out in a simple way. There are too many conditions for double and single new lines that can also overlap the next entry.

import re

#describe the format of an entry
fmt  = re.compile(r'^(?P<title>([\w\d ] ))\n([-] )\nAuthor: (?P<author>([\w\d ] ))\n\n', re.I|re.M)

# SINGLE ENTRY
dat  = ('Title 1\n'
        '-------------------------\n'
        'Author: Some Guy\n\n'
        'Sentence 1.\n\n'
        'Sentence 2.\n\n')

#get book  
m    = fmt.search(dat)
book = dict(title=m.group('title'), author=m.group("author"), body=dat[m.end():len(dat)]) if m else None
 
#print book
print(book)

# MULTIPLE ENTRIES       
dat = ('Title 1\n'
        '-------------------------\n'
        'Author: Some Guy\n\n'
        'Sentence 1.\n\n'
        'Sentence 2.\n\n'
        'Title 2\n'
        '-------------------------\n'
        'Author: Some Other Guy\n\n'
        'Sentence 1.\nSentence 2\n\n'
        'Sentence 3.\n\n')

#prime books
books = list()
#to be used for storing body offsets
ofs   = list()
#for storing positions that need to be carried over
p     = -1

#get all body offsets
for m in fmt.finditer(dat):
    if p > -1: ofs.append((p, m.start()))
    p = m.end()
    books.append(dict(title=m.group('title'), author=m.group("author"), body=" "))

#store final offsets    
ofs.append((p, len(dat)))

#store each body based on offsets
for i, (s, e) in enumerate(ofs):
    books[i]['body']=dat[s:e]
    
#print all books
print(books)
  • Related