I'm very new to python and don't know how to put a dataset into two different lists-CodePudding

So I am given a dataset(student number, first name, last name, date of birth, study program) and with this I have to create a program that processes this data and puts them in one of two lists: valid data and corrupted data. Sometimes data values are corrupted and the program must report corrupted values. Any invalid or empty value is defined as corrupted.

Student number has this format: 7 digits, starting with 0 and second digit (from left) can be either 9 or 8. Example: 0212345 is not valid
First name and last names, contains only alphabet. Date of birth has this format: YYYY-MM-DD. Days between 1 and 31, months between 1 and 12 and Years between 1960 and 2004.
Study program can have one of these values: INF, TINF, CMD, AI.

I also have a csv file with the dataset which looks like this:

0893527,Ruggiero,Fifield,1976-08-18,DS
0944991,Vanny,Jerromes,1996-08-10,TINF
0959490,Abbe,Trees,1986-11-29,DS

This obviously is not the entire list, but the rest looks exactly the same.

I really need help with this since I'm getting nowhere. Any help and/or tips are appreciated

This is the code that I already have made:

import os
import sys

valid_lines = []
corrupt_lines = []



def validate_data(line):
    pass

def main(csv_file):
    with open(os.path.join(sys.path[0], csv_file), newline='').readlines() as csv_file:

        next(csv_file)

        for line in csv_file:
            validate_data(line.strip())
            for digits in csv_file:
               if csv_file[1] != (8,9):
                   print('')



    print('### VALID LINES ###')
    print("\n".join(valid_lines))
    print('### CORRUPT LINES ###')
    print("\n".join(corrupt_lines))


if __name__ == "__main__":    
    main('students.csv')

CodePudding user response：

You can try to use re module to validate number, names. For a date you can use str.split. For a valid program you can use set:

import re
import csv

valid, corrupted = [], []

pat_number = re.compile(r"^0[89]\d{5}$")
pat_names = re.compile(r"^[a-zA-Z] $")

valid_programs = {"INF", "TINF", "CMD", "AI"}

with open("your_data.csv", "r") as f_in:
    reader = csv.reader(f_in)
    for row in reader:
        number, first_name, last_name, date, program = row

        match = pat_number.search(number)
        if not match:
            print(f"{number=} invalid")
            corrupted.append(row)
            continue

        match = pat_names.search(first_name)
        if not match:
            print(f"{first_name=} invalid")
            corrupted.append(row)
            continue

        match = pat_names.search(last_name)
        if not match:
            print(f"{last_name=} invalid")
            corrupted.append(row)
            continue

        try:
            y, m, d = map(int, date.split("-"))

            if y < 1960 or y > 2004:
                print(f"{y=} invalid")
                corrupted.append(row)
                continue

            if m < 1 or m > 12:
                print(f"{m=} invalid")
                corrupted.append(row)
                continue

            if d < 1 or d > 31:
                print(f"{d=} invalid")
                corrupted.append(row)
                continue
        except:
            print(f"{date=} invalid")
            corrupted.append(row)
            continue

        if program not in valid_programs:
            print(f"{program=} invalid")
            corrupted.append(row)
            continue

        valid.append(row)

print(f"{valid=}")
print("-" * 80)
print(f"{corrupted=}")

Prints:

program='DS' invalid
program='DS' invalid
valid=[['0944991', 'Vanny', 'Jerromes', '1996-08-10', 'TINF']]
--------------------------------------------------------------------------------
corrupted=[['0893527', 'Ruggiero', 'Fifield', '1976-08-18', 'DS'], ['0959490', 'Abbe', 'Trees', '1986-11-29', 'DS']]