Home > Blockchain >  Find start and end indexes of list of substrings
Find start and end indexes of list of substrings

Time:10-25

For a sentence

  • 'Foo bar was open on 12.03.2022 and closed on 3.05.22.' with respective
  • list = [12.03.2022, 4.04.2022, 3.05.22]

I want to get the start and end indices in the sentence as a tuple if a date in the list can be found in the sentence.

In this case: [(20,29), (45, 51)]

I have found the dates through regex but I cannot get the indices.

DAY = r'(?:(?:0)[1-9]|[12]\d|3[01])'  # day can be from 1 to 31 with a leading zero 
MONTH = r'(?:(?:0)[1-9]|1[0-2])' # month can be 1 to 12 with a leading zero
YEAR1 = r'(?:(?:20|)\d{2}|(?:19|){9}[0-9])'  # Restricted the year to begin in 20th or 21st century 
                            # Also the first two digits may be skipped if data is represented as dd.mm.yy  
YEAR2 = r'(?:20\d{2}|199[0-9])'
                     
BEGIN_LINE1 = r'(?<!\w)'
DELIM1 = r'(?:[\,\/\-\._])' 
DELIM2 = r'(?:[\,\/\-\._])?'

# combined, several options
NUM_DATE =  f"""(?P<date>
    (?:
        # DAY MONTH YEAR
        (?:{BEGIN_LINE1}{DAY}{DELIM1}{MONTH}{DELIM1}{YEAR1})
        |
        (?:{BEGIN_LINE1}{DAY}{DELIM1}{MONTH})
        |
        (?:{BEGIN_LINE1}{MONTH}{DELIM1}{YEAR1})
        |
        (?:{BEGIN_LINE1}{DAY}{DELIM2}{MONTH}{DELIM2}{YEAR2})
        |
        (?:{BEGIN_LINE1}{MONTH}{DELIM2}{YEAR2})
    )
)"""


myDate = re.compile(f'{NUM_DATE}', re.IGNORECASE | re.VERBOSE | re.UNICODE)


def find_date(subject):
    """_summary_

    Args:
        subject (_type_): _description_

    Returns:
        _type_: _description_
    """
    
    if subject is None:
        return subject
    
    dates = list(set(myDate.findall(subject)))
    
    
    return dates

CodePudding user response:

Use re.search:

sent = 'Foo bar was open on 12.03.2022 and closed on 3.05.22.'
date_list = ['12.03.2022', '4.04.2022', '3.05.22']

hits = [re.search(date, sent) for date in date_list if re.search(date, sent)]
# indices of first match:
hits[0].span()

hits[0].span() will give you the indices and hits[0].group() the matched substring

CodePudding user response:

using regular for loop.

sent = 'Foo bar was open on 12.03.2022 and closed on 3.05.22.'
list = ['12.03.2022', '4.04.2022', '3.05.22']
tup = []
for i in list:
    if i in sent:
        start_index = sent.index(i)
        end_index = start_index   len(i) - 1
        tup.append((start_index, end_index))

using list comprehension:

tup = [(sent.index(i), sent.index(i)   len(i) - 1) for i in list if i in sent]
print(tup)

>>>> [(20, 29), (45, 51)]
  • Related