Best approach for processing csv files with different headers-CodePudding

I am just building a personal finance planner. What i want to do is, the following.

Download account statements in csv files of different bank accounts.
loop via Python code over each fiel in the folder and read data.
write data from each column into sql database

The problem that i have is now the following. Every csv file has a slightly different csv structure (headers are different).

I would like to know what is now the best approach to process this data.

At the moment i have created different functions for each bank to create same structure. The probelm here is, that most of the code is identically and if i want to change anything, i need to make changes in all functions.

Is there a way to sturcture the data in the csv so that i can send them all to the same function for processing. The problem that i see here is, that i need to process the same csv file multiple times (first structure it, and then read it).

Here are some examples how the header structure look like:

Bank 1: ['account_num', 'text', 'date', 'valutadate', 'amount', 'currency']

Bank 2: ['account_num', 'text', 'valutadate', 'currency', 'amount_withdrawal', 'amount_deposit']

I would need them both in the following order: ['account_num', 'text', 'valutadate', 'amount', 'currency', 'category']

and here is the code for both of my functions. They are just in an very early stage. But both work so far as the pull the data from the csv and provide me in the correct structure.

# function for readout bank1 data
def read_bank(filename):
    # Create header for data in csv
    header = ['account_num', 'text', 'date', 'valutadate', 'amount', 'currency']

    # Create empty pandas dataframe for transporting data to the database
    head_row = ['account_num', 'text', 'valutadate', 'amount', 'currency', 'category']
    bank1_df = pd.DataFrame(columns=head_row)

    # Processes data from file
    with open(filename, 'r') as fhandle:
        csv_reader = csv.DictReader(fhandle, delimiter=';', fieldnames=header) # Adds a fieldname to the columns, so i can access them by name instead of index.
        # header = next(csv_reader) # Attention, if we use this line, we define the first row in the csv file as header and don´t read it.
        # print(header)

        # loop over data and write into dataframe
        for row_index, row in enumerate(csv_reader):

            # Safety check if all rows are equal in lenght, otherwise there could be some problem with the transformation of the data.
            if len(row) != 6:
                print(f"Row {row_index} is only {len(row)} elements long.")
                break
            
            # Definition of data elements
            account_num = row['account_num']
            text = row['text']
            date = datetime.strptime(row['date'], '%d.%m.%Y')
            valutadate = datetime.strptime(row['valutadate'], '%d.%m.%Y')
            amount = float(row['amount'].replace('.','').replace(',','.'))
            currency = row['currency']
            category = match_c.match_category(text)

            # Create hashvalue of each record, to add a uniqe identifier in the table.
            # hash_val = hash_func(account_num, text, valutadate, amount)

            # Write data elements into dataframe as new row (.loc[row_index])
            bank1_df.loc[row_index] = [account_num, text, valutadate, amount, currency, category]

            # print(f"hash: {hash_val.hexdigest()}; account_num: {account_num}; text: {text}; date: {date}; valutadate: {valutadate}; amount: {amount}; currency: {currency}")
      
        print(bank1_df)
        
        print(f"Total Rows: {row_index   1}")

# function for readout bank2 data
def read_bank2(filename):
    # Create header for data in csv
    header = ['account_num', 'text', 'valutadate', 'currency', 'amount_withdrawal', 'amount_deposit']

    # Create empty pandas dataframe for transporting data to the database
    head_row = ['account_num', 'text', 'valutadate', 'amount', 'currency', 'category']
    bank2_df = pd.DataFrame(columns=head_row)

    # Processes data from file
    with open(filename, 'r') as fhandle:
        csv_reader = csv.DictReader(fhandle, delimiter=';', fieldnames=header) # Adds a fieldname to the columns, so i can access them by name instead of index.
        header = next(csv_reader) # Attention, if we use this line, we define the first row in the csv file as header and don´t read it.
        # print(header)

        # loop over data and write into dataframe
        for row_index, row in enumerate(csv_reader):

            # Safety check if all rows are equal in lenght, otherwise there could be some problem with the transformation of the data.
            if len(row) != 6:
                print(f"Row {row_index} is only {len(row)} elements long.")
                break
            
            # Definition of data elements
            account_num = row['account_num']
            text = row['text']
            valutadate = datetime.strptime(row['valutadate'], '%d.%m.%Y')
            currency = row['currency']
            amount_withdrawal = float(row['amount_withdrawal'].replace('.','').replace(',','.'))
            amount_deposit = float(row['amount_deposit'].replace('.','').replace(',','.'))
            amount = amount_deposit - amount_withdrawal
            category = match_c.match_category(text)

            # Create hashvalue of each record, to add a uniqe identifier in the table.
            # hash_val = hash_func(account_num, text, valutadate, amount)

            # Write data elements into dataframe as new row (.loc[row_index])
            bank2_df.loc[row_index] = [account_num, text, valutadate, amount, currency, category]

            # print(f"hash: {hash_val.hexdigest()}; account_num: {account_num}; text: {text}; valutadate: {valutadate}; currency: {currency}; amount_withdrawal: {amount_withdrawal}; amount_deposit: {amount_deposit}")
        
        print(bank2_df)

        print(f"Total Rows: {row_index   1}")

Thank you for any feedback.

CodePudding user response：

You could create a dictionary that has all the possible names for a given column:

column_names = {
    'date': ['date', 'valutadate'],
    'amount': ['amount', 'amount_withdrawal']
}

Then you can look up a given input csv column name in this dictionary to find the "real" name for that column in the final output csv.