Home > OS >  Finding date in a list of strings
Finding date in a list of strings

Time:12-11

I have a list of strings with different values, and im trying to find the strings in the list that are dates and return the index of the date. I tried using the dateutil parser like this:

x = ["test", "Hello", "abc", "27.02.2020"]
for item in x:
    if parse(item) == True:
        print(x.index(item))

This does not work since most of the strings in my list are not dates, and the parser does not recognize the format of the strings not being dates. Anyone got a solution to how i could solve this differently?

CodePudding user response:

There are many ways to do it, the easiest one would be use pandas.to_datetime() which raise exception (ParserError) if not date

import pandas as pd #pip install pandas


def is_date(date_string):
    try:
        pd.to_datetime(date_string, format='%d.%m.%Y')
        return True
    except Exception:
        return False
    
x = ["test", "Hello", "abc", "27.02.2020"]
for index, item in enumerate(x):
    if is_date(item):
        print(index)

CodePudding user response:

The simplest solution is to use regular expression

import re

x = ["test", "Hello", "abc", "27.02.2020"]
for item in x:
    if re.match(r'[0-9]{2}\.[0-9]{2}\.[0-9]{4}', item):
        print(item, 'is a date')

Note that it cannot validate the dates, such as knowing 32nd of December is not a valid date

CodePudding user response:

There are different approaches for this, including the one proposed above - pattern matching. If all the dates in your array have the same format, you could write a helper function which tries to parse the date given a format and, if a ValueError is thrown because it is not a date, returns None or whatever you prefer:

from datetime import datetime as dt

def try_parse(x, date_format="%d.%m.%Y"):
    try:
        return dt.strptime(x, date_format)
    except ValueError:
        return None

lst = ["test", "Hello", "abc", "27.02.2020"]

[try_parse(x) for x in lst]

OUTPUT

[None, None, None, datetime.datetime(2020, 2, 27, 0, 0)]

The advantage of this is that the only needed library is datetime. In addition, you could make it more robust passing a list of possible expected date formats and try to parse for all of them - so you are not limited to just your %d.%m.%Y - defaulting to something is no parsing is successful as in the solution above.

Anyway, I think this is what essentially pandas.to_datetime does, so you could simply do:

import pandas as pd

[pd.to_datetime(y, errors="coerce") for y in x]

OUTPUT

[NaT, NaT, NaT, Timestamp('2020-02-27 00:00:00')]

The errors=coerce option makes the method return NaT in case parsing fails.

  • Related