Home > OS >  Capture elements between keywords, and create dict with lists of values in between keywords
Capture elements between keywords, and create dict with lists of values in between keywords

Time:07-30

Sorry for the poorly worded title, it's the best I can do.

I have two lists, one that is a list of keywords, named breakpoints, and one that is a list of string values I created with BeautifulSoup pulled.

from bs4 import BeautifulSoup
from pymongo import MongoClient

client = MongoClient('connection string')

res = client['database']['collection'].find({"created": {"$gte": datetime(2022, 1, 1)}})

soup = BeautifulSoup(list(res)[0]['text'], 'html.parser') #text stores some html code for injection into a web page

string_list = []
string_dict = {}
for string in soup.stripped_strings:
  string_list.append(string)

#string_list = ['foo', 'bar', 'data1', 'data2', 'baz', 'data3', 'data4', ... ,'eggs', 'dataX']
breakpoints = ['foo', 'bar', 'baz' ... 'eggs']

for b in range(len(breakpoints)):
  for s in range(len(string_list)):
    if breakpoints[b]==string_list[s]:
      string_dict[breakpoints[b]] = [string_list[s]:string_list.index(breakpoints[b 1])]

the last line, is invalid syntax, what I would like as a final output is:


string_dict: {
  "foo": [], #or just None or an empty string, anything to indicate 'no values'
  "bar": ['data1', 'data2']
  ...
  "eggs": ['dataX']
  }

I know I still have to throw in some logic for when it reaches the last index of the breakpoints list, but that's fairly trivial. I'm having trouble figuring out how to get just the basic logic down for constructing the lists for the dict.

Appreciate any help.

CodePudding user response:

I programmed a minimum example, that should solve the task, as you plan.

string_dict = {}
string_list = ['foo', 'bar', 'data1', 'data2', 'baz', 'data3', 'data4', 'eggs', 'dataX']
breakpoints = ['foo', 'bar', 'baz', 'eggs']

if len(breakpoints) > 0:
    start_idx = string_list.index(breakpoints[0]) 1
    for b, breakpoint in enumerate(breakpoints):
        if b   1 < len(breakpoints):
            end_idx = string_list.index(breakpoints[b 1])
            string_dict[breakpoint] = string_list[start_idx:end_idx]
            start_idx = end_idx   1
        else:
            string_dict[breakpoint] = string_list[start_idx:]

This produces the following dictionary:

{
  foo: []
  bar: ['data1', 'data2']
  baz: ['data3', 'data4']
  eggs: ['dataX']
}

Note, that this code only works, as long as all break points are unique, but I guess, that you already considered that. Note also, that for very long string_list slicing will improve execution time. That means that you will not look into the whole list, but rather search the list starting from the last break point. But this will of course make the code slightly more complicated.

CodePudding user response:

from collections import defaultdict
string_list = ['foo', 'bar', 'data1', 'data2', 'baz', 'data3', 'data4', 'eggs', 'dataX']
breakpoints = ['foo', 'bar', 'baz','eggs']
d = defaultdict(list)
for i in breakpoints:
    d[i]
for key in d.keys():
    key_ind=string_list.index(key)
    for ind in range(key_ind 1,len(string_list)):
        if string_list[ind] in d.keys():
                break
        else:
                d[key].append(string_list[ind])
    
print(d)

output:

defaultdict(<class 'list'>, {'foo': [], 'bar': ['data1', 'data2'], 'baz': ['data3', 'data4'], 'eggs': ['dataX']})

We can achieve this logic by using default dict. Where we can set dictionay keys prior and get index based on dictionary key and loop through it.

CodePudding user response:

IIUC, this oneliner should work for you:

string_list = ['foo', 'bar', 'data1', 'data2', 'baz', 'data3', 'data4','eggs', 'dataX']
breakpoints = ['foo', 'bar', 'baz', 'eggs']

{string_list[i[0]:i[1]][0]: string_list[i[0]:i[1]][1:] for i in list(zip([0]   [string_list.index(x) for x in breakpoints], [string_list.index(x) for x in breakpoints]   [len(string_list)]))[1:]}

Output:

{'foo': [],
 'bar': ['data1', 'data2'],
 'baz': ['data3', 'data4'],
 'eggs': ['dataX']}
  • Related