So I am given a dataset(student number, first name, last name, date of birth, study program) and with this I have to create a program that processes this data and puts them in one of two lists: valid data and corrupted data. Sometimes data values are corrupted and the program must report corrupted values. Any invalid or empty value is defined as corrupted.
Student number has this format: 7 digits, starting with 0 and second digit (from left) can be either 9 or 8. Example: 0212345 is not valid
First name and last names, contains only alphabet. Date of birth has this format: YYYY-MM-DD. Days between 1 and 31, months between 1 and 12 and Years between 1960 and 2004.
Study program can have one of these values: INF, TINF, CMD, AI.
I also have a csv file with the dataset which looks like this:
0893527,Ruggiero,Fifield,1976-08-18,DS
0944991,Vanny,Jerromes,1996-08-10,TINF
0959490,Abbe,Trees,1986-11-29,DS
This obviously is not the entire list, but the rest looks exactly the same.
I really need help with this since I'm getting nowhere. Any help and/or tips are appreciated
This is the code that I already have made:
import os
import sys
valid_lines = []
corrupt_lines = []
def validate_data(line):
pass
def main(csv_file):
with open(os.path.join(sys.path[0], csv_file), newline='').readlines() as csv_file:
next(csv_file)
for line in csv_file:
validate_data(line.strip())
for digits in csv_file:
if csv_file[1] != (8,9):
print('')
print('### VALID LINES ###')
print("\n".join(valid_lines))
print('### CORRUPT LINES ###')
print("\n".join(corrupt_lines))
if __name__ == "__main__":
main('students.csv')
CodePudding user response:
You can try to use re
module to validate number, names. For a date you can use str.split
. For a valid program you can use set
:
import re
import csv
valid, corrupted = [], []
pat_number = re.compile(r"^0[89]\d{5}$")
pat_names = re.compile(r"^[a-zA-Z] $")
valid_programs = {"INF", "TINF", "CMD", "AI"}
with open("your_data.csv", "r") as f_in:
reader = csv.reader(f_in)
for row in reader:
number, first_name, last_name, date, program = row
match = pat_number.search(number)
if not match:
print(f"{number=} invalid")
corrupted.append(row)
continue
match = pat_names.search(first_name)
if not match:
print(f"{first_name=} invalid")
corrupted.append(row)
continue
match = pat_names.search(last_name)
if not match:
print(f"{last_name=} invalid")
corrupted.append(row)
continue
try:
y, m, d = map(int, date.split("-"))
if y < 1960 or y > 2004:
print(f"{y=} invalid")
corrupted.append(row)
continue
if m < 1 or m > 12:
print(f"{m=} invalid")
corrupted.append(row)
continue
if d < 1 or d > 31:
print(f"{d=} invalid")
corrupted.append(row)
continue
except:
print(f"{date=} invalid")
corrupted.append(row)
continue
if program not in valid_programs:
print(f"{program=} invalid")
corrupted.append(row)
continue
valid.append(row)
print(f"{valid=}")
print("-" * 80)
print(f"{corrupted=}")
Prints:
program='DS' invalid
program='DS' invalid
valid=[['0944991', 'Vanny', 'Jerromes', '1996-08-10', 'TINF']]
--------------------------------------------------------------------------------
corrupted=[['0893527', 'Ruggiero', 'Fifield', '1976-08-18', 'DS'], ['0959490', 'Abbe', 'Trees', '1986-11-29', 'DS']]