Data Cleansing Dates in Python-CodePudding

I am processing a .csv in Python which has a column full of timestamps in the below format:-

August 21st 2020, 13:58:19.000

The full content of a line in the .csv looks like this:-

"August 21st 2020, 14:55:12.000","[email protected]",IE,Ireland,"192.168.0.1",2

What I wish to do is populate a separate column with the respective date and time and so I'd have '2020-08-21' in a new column and '13:58:19' in an additional new column.

Looking at this, I think I need to firstly tidy up the date and remove the letters after the actual date (21st, 4th, 7th etc) so that they read 21, 4 etc...

I started writing the below but this will replace the actual numbers too. How can I just strip out the letters after the numbers? I did think of a simple find/replace but this would remove 'st' from August for example hence a regex is needed I think to look for a space following by 1 or 2 numbers and then check if it is followed by ['nd','st','rd','th'] and if so, strip this out:-

with open('input_csv', 'r') as input_file:
    lines = input_file.readlines()
    for line in lines:
        line = re.sub(r"\s\d ", "", line)
        print(line)

Based on a suggestion to use the datetime module, I tried splitting to see if this helped:-

import re from datetime import datetime

with open('input_csv.csv', 'r') as input_file:
    lines = input_file.readlines()[1:]
    for line in lines:
        line_lst = line.split(',')
        line_date = datetime.strptime(line_lst[0], '%B %d %y')
        line_time = datetime.strptime(line_lst[1], '%H:%M:%S.%f')
        print(line_date)
        print(line_time)

but I still receive an error:-

ValueError: time data '"May 12th 2020' does not match format '%B %d %y'

CodePudding user response：

It's probably easier to use the datetime library. For example,

from datetime import datetime

with open('input_csv', 'r') as input_file:
    lines = input_file.readlines()
    for line in lines:
        line_date = datetime.strptime(line, '%B %d %y, %H:%M:%S.%f')

and then use strftime() to write it to the new file in the new format. To read more about these methods:

https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior

CodePudding user response：

I recommend using as many good libraries as you can for this: a lot of what you want to do has already been developed and tested.

Use a CSV parser

Parsing a CSV can look simple at first but become increasinly difficult as there are a lot of special rules around quoting (escaping) characters. Use Python's stand CSV module to do this:

import csv

with open('input.csv', newline='') as f:
    reader = csv.reader(csv_file)
    for row in reader:
        date_val = row[0]
        print(f'Raw string: {date_val}')

and I get the following without any special code:

Raw string: August 21st 2020, 14:55:12.000

Use a really good date-time parser

As a couple of others already pointed out, it's important to use a date-time parser, but they didn't tell you that Python's standard parser cannot handle the ordinals like '12th' and '21st' (maybe they just didn't see those...). The 3rd-party dateutil lib can handle them, though:

from dateutil import parser
...
...
        print(f'Raw string: {date_val}')

        dt = parser.parse(date_val)
        print(f'Date: {dt.date()}')
        print(f'Time: {dt.time()}')

Date: 2020-08-21
Time: 14:55:12

Make a new a CSV

Again, use the CSV module's writer class to take the modified rows from the reader and turn them into a CSV.

For simplicity, I recommend just accumulating new rows as you're reading and parsing the input. I've also included commented-out lines for copying a header, if you have one:

...
new_rows = []
with open('input.csv', newline='') as f:
    reader = csv.reader(f)

    # header = next(reader)
    # new_rows.append(header)

    for row in reader:
        date_val = row[0]
        remainder = row[1:]

        dt = parser.parse(date_val)

        new_row = [dt.date(), dt.time()]
        new_row  = remainder

        new_rows.append(new_row)


with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(new_rows)

My output.csv looks like this:

2020-08-21,14:55:12,[email protected],IE,Ireland,192.168.0.1,2