Let's say that I have the following multiline string. We can assume that title is always followed by a line.
"""
This is title
-------------------------
Author: Name of the author
Sentence 1.
Sentence 2.
"""
And I want to convert that to a dict like this:
{
"title": "This is title",
"author": "Name of the author",
"body": "sentence 1.\n\nSentence 2.",
}
How can I split the first two lines with all those "-----" and later split the rest of the line by new line? Could you please give me some suggestions?
CodePudding user response:
Assuming this is the standard layout of all strings given. You can use multiple assignment and str.split
to get your values split into variables then construct your dict
. You just need to use str.join
to re-join the strings after being split apart:
s = """
This is title
-------------------------
Author: Name of the author
Sentence 1.
Sentence 2.
"""
_, title, _, author, *body = s.split('\n')
data = {
"title": title,
"author": ' '.join(author.split()[1:]),
"body": '\n'.join(body)
}
PrettyPrinted output:
{'title': 'This is title',
'author': 'Name of the author',
'body': '\nSentence 1.\n\nSentence 2.\n'}
Although this works it is a bit messy/adhoc for real world application. Please see if you can reformat how your data is supplied/stored if you want a more concrete solution.
CodePudding user response:
This is how I solved this:
What part could be considered static?
IMO, the title, title line, and author can be considered static (so to speak), because the title line is a reliable delimiter for the other 2. RegEx allows us to describe a string format that we expect so, I wrote a regex format to describe the parts that we are considering static.
How do I get the rest?
RegEx also allows us to store the start
and end
positions of our matches. Properly storing this information we can create a list of offsets that pinpoint all of the gaps in our matches. That information is "the rest".
Why not just add the body into the format?
I couldn't figure it out in a simple way. There are too many conditions for double and single new lines that can also overlap the next entry.
import re
#describe the format of an entry
fmt = re.compile(r'^(?P<title>([\w\d ] ))\n([-] )\nAuthor: (?P<author>([\w\d ] ))\n\n', re.I|re.M)
# SINGLE ENTRY
dat = ('Title 1\n'
'-------------------------\n'
'Author: Some Guy\n\n'
'Sentence 1.\n\n'
'Sentence 2.\n\n')
#get book
m = fmt.search(dat)
book = dict(title=m.group('title'), author=m.group("author"), body=dat[m.end():len(dat)]) if m else None
#print book
print(book)
# MULTIPLE ENTRIES
dat = ('Title 1\n'
'-------------------------\n'
'Author: Some Guy\n\n'
'Sentence 1.\n\n'
'Sentence 2.\n\n'
'Title 2\n'
'-------------------------\n'
'Author: Some Other Guy\n\n'
'Sentence 1.\nSentence 2\n\n'
'Sentence 3.\n\n')
#prime books
books = list()
#to be used for storing body offsets
ofs = list()
#for storing positions that need to be carried over
p = -1
#get all body offsets
for m in fmt.finditer(dat):
if p > -1: ofs.append((p, m.start()))
p = m.end()
books.append(dict(title=m.group('title'), author=m.group("author"), body=" "))
#store final offsets
ofs.append((p, len(dat)))
#store each body based on offsets
for i, (s, e) in enumerate(ofs):
books[i]['body']=dat[s:e]
#print all books
print(books)