I have two lists.
The first list is the authors' list. The second list has two types of objects, authors and text. Text is split into words. The structure of the second list is that there is an author first and several words which together make his speech. In the second list there are several authors with their speeches.
authors = ['M. Maxime Gremet', 'M. le président.', 'M.Claude Goasgu', 'M.Jean-Marc Ayr',
'M.Maxime Gremet', 'M.Roland Chassa', 'M.le président.']
authors_and_words = ['M. le président.', "Conformément au premier alinéa de l'article 28 de la Constitution, je déclare ouverte la session ordinaire de 2003-2004.", "Mes chers collègues, permettez-moi d'abord de vous dire combien je suis heureux de vous retrouver tous.", 'M. Maxime Gremetz.', 'Nous aussi !']
I would like to extract an author and the words of his speech from the second list into a new list (or even better a dictionary).
Output dictionary would be of the following structure:
{'author': ['word1', 'word2', 'word3']}
If we take the actual lists, the solution would be the following list.
solution = [{'M. le président.': ["Conformément au premier alinéa de l'article 28 de la Constitution, je déclare ouverte la session ordinaire de 2003-2004.", "Mes chers collègues, permettez-moi d'abord de vous dire combien je suis heureux de vous retrouver tous."]}, {'M. Maxime Gremetz.':['Nous aussi !']}]
I tried to do with different types of loops but I struggle to keep the state of the second list. I guess there is an algorithimic solution for this, but unfortunately I don't have much experience with algorithms.
CodePudding user response:
Checking if an element of authors_and_words
is in authors
is easy enough: use the in
keyword.
Now, iterate over each element of authors_and_words
, and if it is an author, that is your dictionary key. Whenever you find a new author, create a dictionary that will store their lines, and append this dictionary to the main list containing all authors' lines:
conversation = ["junk value"] # Initialize with a junk value to prevent error in `conversation[-1]` in the first loop
for line in authors_and_words:
# < is an author> and <prev author is different >
if line in authors and line not in conversation[-1]:
d = {line: []} # Create an empty dict for this author
conversation.append(d)
else:
author = list(conversation[-1].keys())[0] # Get author from the last dict in the list
conversation[-1][author].append(line)
result = conversation[1:] # Discard junk element from result
This gives:
[{'M. le président.': ["Conformément au premier alinéa de l'article 28 de la Constitution, je déclare ouverte la session ordinaire de 2003-2004.",
"Mes chers collègues, permettez-moi d'abord de vous dire combien je suis heureux de vous retrouver tous."]},
{'M. Maxime Gremetz.': ['Nous aussi !']}]
Note 1: if authors
is a long list, it will be more efficient to convert it to a set and check membership in this set instead.
Note 2: I fixed typos in your authors
list to allow line in authors
to work. If this is not the case, and you expect to be able to handle such typos, then you will need to replace line in authors
with list_start_match(authors, line)
, with the following definition of list_start_match
. In this case, you won't be able to use a set for quick membership checks:
def list_start_match(lst, val):
for elem in lst:
if elem.startswith(val): return True
return False
Here, I use .startswith
to check if each element of lst
starts with val
. If you have different criteria for matching the author given the partial value, you can use that.
Note 3: Your expected output is a list of dictionaries, not a dictionary. You could have a dict be the output, but then you lose the order in which people said things, since dicts can only have one value per key. This value can be a list, which allows you to fit multiple "lines" of conversation per person, but fails with a conversation that alternates between multiple people. This doesn't seem to be what you're looking for anyway, so I guess it's a moot point.
Note 4: If you have control over the schema of the output, I would strongly suggest against having keys named after your authors. This makes it harder to figure out who is the author (the author = list(conversation[-1].keys())[0]
shenanigans we did above).
Instead, consider changing each element of your output to something like this:
{"author": 'M. Maxime Gremetz.',
"lines": ['Nous aussi !']}
which allows you to access the author of every snippet using the "author"
key. If you do decide to go with this, you'll have to modify the code like so:
conversation = ["junk value"] # Initialize with a junk value to prevent error in `conversation[-1]` in the first loop
for line in authors_and_words:
# < is an author> and <prev author is different >
if line in authors and line not in conversation[-1]:
d = {"author": line, "lines": []} # Create an empty dict for this author
conversation.append(d)
else:
conversation[-1]["lines"].append(line)
result = conversation[1:] # Discard junk element from result
which gives the following:
[{'author': 'M. le président.',
'lines': ["Conformément au premier alinéa de l'article 28 de la Constitution, je déclare ouverte la session ordinaire de 2003-2004.",
"Mes chers collègues, permettez-moi d'abord de vous dire combien je suis heureux de vous retrouver tous."]},
{'author': 'M. Maxime Gremetz.', 'lines': ['Nous aussi !']}]
CodePudding user response:
Pranav provided a great answer to my question but later I finilized my own solution. It's not as elegant as his but also works.
solution = []
items_counter = 0
authors_counter = 0
while items_counter < len(authors_and_words):
current_author = ''
for i in authors_and_words:
if i in authors:
temp_dict = {}
temp_dict[i] = []
items_counter =1
current_author = i
solution.append(temp_dict)
authors_counter =1
else:
items_counter = 1
temp_dict[current_author].append(i)