Home > OS >  Is there a better way to create index for headings and sub-headings of a document in python?
Is there a better way to create index for headings and sub-headings of a document in python?

Time:03-28

I have a list of headings and subheadings of a document.

test_list = ['heading', 'heading','sub-heading', 'sub-heading', 'heading', 'sub-heading', 'sub-sub-heading', 'sub-sub-heading', 'sub-heading', 'sub-heading', 'sub-sub-heading', 'sub-sub-heading','sub-sub-heading', 'heading']

I want to assign unique index to each of the heading and the subheading like follows:

seg_ids = ['1', '2', '2_1', '2_2', '3', '3_1', '3_1_1', '3_1_2', '3_2', '3_3', '3_3_1', '3_3_2', '3_3_3', '4']

This is my code to create this result but it is messy and it is restricted to depth 3. If there is any document with a sub-sub-sub heading the code would become more complicated. Is there any pythonic way to do this?

seg_ids = []
for idx, an_ele in enumerate(test_list):
    
    head_id = 0
    subh_id = 0
    subsubh_id = 0
    if an_ele == 'heading' and idx == 0:  # if it is the first element 
        head_id = '1'
        seg_ids.append(head_id)
        
        
    else:
        last_seg_ids = seg_ids[idx-1].split('_')  # find the depth of the last element
        head_id = last_seg_ids[0]
        
        if len(last_seg_ids) == 2:  
            subh_id = last_seg_ids[1]
        elif len(last_seg_ids) == 3:
            subh_id = last_seg_ids[1]
            subsubh_id = last_seg_ids[2]
            
           
        if an_ele == 'heading':
            head_id= str(int(head_id) 1) 
            subh_id = 0  # reset sub_heading index 
            subsubh_id = 0 # reset sub_sub_heading index 

        elif an_ele == 'sub-heading':
            subh_id= str(int(subh_id) 1)
            subsubh_id = 0  # reset sub_sub_heading index 
        elif an_ele == 'sub-sub-heading':
            subsubh_id= str(int(subsubh_id) 1)
        else:
            print('ERROR')
            
        
        if subsubh_id==0:
            if subh_id !=0:
                seg_ids.append(head_id '_' subh_id)
                
            else:
                seg_ids.append(head_id)
                
        if subsubh_id !=0:
            seg_ids.append(str(head_id) '_' str(subh_id) '_' str(subsubh_id))
            
          
            
print(seg_ids)        

CodePudding user response:

def get_level(s):
    return s.count('-')

def translate(test_list):
    seg_ids = []
    levels = [0]*9
    last_level = 99
    for an_ele in test_list:
        level = get_level(an_ele)
        if level <= last_level:
            levels[level]  = 1
        else:
            levels[level] = 1
        seg_ids.append( '_'.join(str(k) for k in levels[:level 1]))
        last_level = level
    return seg_ids

print(translate(['heading', 'heading','sub-heading', 'sub-heading', 'heading', 'sub-heading', 'sub-sub-heading', 'sub-sub-heading', 'sub-heading', 'sub-heading', 'sub-sub-heading', 'sub-sub-heading','sub-sub-heading', 'heading']))

Output:

['1', '2', '2_1', '2_2', '3', '3_1', '3_1_1', '3_1_2', '3_2', '3_3', '3_3_1', '3_3_2', '3_3_3', '4']

This fixes the maximum number of levels at 9. You could extend that by setting levels=[0] and then extending it if the new level was beyond the end, but this gets the point across.

CodePudding user response:

You may use the split('-') method to find the level of the heading:

subs_amount = an_ele.split('-')

You can deduce the level of the heading from the length of the subs_amount list. If the length is 1, then it is a "heading". If it's 3, it is a "sub-sub-heading". Etc. Then, have a list store_levels to store the indexes of the previous headings of greater level, like Tim Roberts says in their comment:

if len(subs_amount) > len(store_levels):
    store_levels.append(1) #add a sub-level
elif len(subs_amount) == len(store_levels):
    store_levels[-1]  = 1 #add a heading of the same level
else:
    del store_levels[-1] #go back to the level above

Now, to build your output, you just have to "_".join(store_levels) and append it to the output.


Sorry for not using the same variable names as you. I did so not to confuse or change their use. I hope my code is clear enough so you can implement it to yours.

  • Related