Home > Software design >  I'm very new to python and don't know how to put a dataset into two different lists
I'm very new to python and don't know how to put a dataset into two different lists

Time:12-22

So I am given a dataset(student number, first name, last name, date of birth, study program) and with this I have to create a program that processes this data and puts them in one of two lists: valid data and corrupted data. Sometimes data values are corrupted and the program must report corrupted values. Any invalid or empty value is defined as corrupted.

  • Student number has this format: 7 digits, starting with 0 and second digit (from left) can be either 9 or 8. Example: 0212345 is not valid

  • First name and last names, contains only alphabet. Date of birth has this format: YYYY-MM-DD. Days between 1 and 31, months between 1 and 12 and Years between 1960 and 2004.

  • Study program can have one of these values: INF, TINF, CMD, AI.

I also have a csv file with the dataset which looks like this:

0893527,Ruggiero,Fifield,1976-08-18,DS
0944991,Vanny,Jerromes,1996-08-10,TINF
0959490,Abbe,Trees,1986-11-29,DS

This obviously is not the entire list, but the rest looks exactly the same.

I really need help with this since I'm getting nowhere. Any help and/or tips are appreciated

This is the code that I already have made:

import os
import sys

valid_lines = []
corrupt_lines = []



def validate_data(line):
    pass

def main(csv_file):
    with open(os.path.join(sys.path[0], csv_file), newline='').readlines() as csv_file:

        next(csv_file)

        for line in csv_file:
            validate_data(line.strip())
            for digits in csv_file:
               if csv_file[1] != (8,9):
                   print('')



    print('### VALID LINES ###')
    print("\n".join(valid_lines))
    print('### CORRUPT LINES ###')
    print("\n".join(corrupt_lines))


if __name__ == "__main__":    
    main('students.csv')

CodePudding user response:

You can try to use re module to validate number, names. For a date you can use str.split. For a valid program you can use set:

import re
import csv

valid, corrupted = [], []

pat_number = re.compile(r"^0[89]\d{5}$")
pat_names = re.compile(r"^[a-zA-Z] $")

valid_programs = {"INF", "TINF", "CMD", "AI"}

with open("your_data.csv", "r") as f_in:
    reader = csv.reader(f_in)
    for row in reader:
        number, first_name, last_name, date, program = row

        match = pat_number.search(number)
        if not match:
            print(f"{number=} invalid")
            corrupted.append(row)
            continue

        match = pat_names.search(first_name)
        if not match:
            print(f"{first_name=} invalid")
            corrupted.append(row)
            continue

        match = pat_names.search(last_name)
        if not match:
            print(f"{last_name=} invalid")
            corrupted.append(row)
            continue

        try:
            y, m, d = map(int, date.split("-"))

            if y < 1960 or y > 2004:
                print(f"{y=} invalid")
                corrupted.append(row)
                continue

            if m < 1 or m > 12:
                print(f"{m=} invalid")
                corrupted.append(row)
                continue

            if d < 1 or d > 31:
                print(f"{d=} invalid")
                corrupted.append(row)
                continue
        except:
            print(f"{date=} invalid")
            corrupted.append(row)
            continue

        if program not in valid_programs:
            print(f"{program=} invalid")
            corrupted.append(row)
            continue

        valid.append(row)

print(f"{valid=}")
print("-" * 80)
print(f"{corrupted=}")

Prints:

program='DS' invalid
program='DS' invalid
valid=[['0944991', 'Vanny', 'Jerromes', '1996-08-10', 'TINF']]
--------------------------------------------------------------------------------
corrupted=[['0893527', 'Ruggiero', 'Fifield', '1976-08-18', 'DS'], ['0959490', 'Abbe', 'Trees', '1986-11-29', 'DS']]
  • Related