Home > front end >  Compare multiple CSV files by row and delete files not needed
Compare multiple CSV files by row and delete files not needed

Time:12-09

I am comparing multiple CSV files against a master file by a selected column values, and want to keep only the file that has the most matches with the master file.

The code I actually created give me the results for each file, but I don't know how to make the comparison between the files themselves, and just keep the one with the highest values sum at the end.

I know how to delete files via os.remove() and so on, but need help with the selection of the maximum value.

data0 = pd.read_csv('input_path/master_file.csv', sep=',')

csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)

for df in csv_files:
    df_base = os.path.basename(df)
    input_dir = os.path.dirname(df)
    data1 = pd.read_csv(df, sep=',')
    comp1 = pd.concat([data0, data1])[['values']]
    cnt1 = comp1.loc[comp1.duplicated()]
    match1 = cnt1.count(axis=1)
    sum = str(sum(match1))
    print('Matches between '   df_base   ' & '   input_dir   ': '   sum)

The print gives me (paths and directories names appear correct):

Matches between ... & ...: 332215
Matches between ... & ...: 273239

Had the idea to try it via sub-lists, but just did not get anywhere.

CodePudding user response:

You could write a function to calculate the "match score" for each file, and use that function as the key argument for the max function:

def match_score(csv_file):
    df_base = os.path.basename(csv_file)
    data1 = pd.read_csv(csv_file, sep=",")
    comp1 = pd.concat([data0, data1])[['values']]
    cnt1 = comp1.loc[comp1.duplicated()]
    match1 = cnt1.count(axis=1)
    return match1.sum()

Then,

csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)
max_match_file = max(csv_files, key=match_score)

CodePudding user response:

You can simplify your code a lot using pathlib.

Addressing your question, you can store the duplicates sum in a dictionary, and after comparing all files, choose the one with most matches. Something like this:

import pandas as pd
from pathlib import Path

main_file = Path('/main/path/main.csv')
main_df = pd.read_csv(main_file)

other_path = Path('/other/path/')
other_files = other_path.rglob('*.csv')

matches_per_file = {}

for other_file in other_files:
    other_df = pd.read_csv(other_file)
    merged_df = pd.concat([main_df, other_df])[['values']]
    dups = merged_df.loc[merged_df.duplicated()]
    dups_sum = sum(dups.count(axis=1))
    matches_per_file[other_file] = dups_sum
    print(f'Matches between {other_file} and {main_file}: {dups_sum}')

# find the file with most matches
most_matches = max(matches_per_file, key=matches_per_file.get)

The code above will populate matches_per_file with pairs filename: matches. That will make it easy for you to find the max(matches) and the corresponding filename, and then decide which files you will keep and which ones you will delete. The variable most_matches will be set with that filename.

Use the code snippet as a starting point, since I don't have the data files to test it properly.

  • Related