I am comparing multiple CSV
files against a master file by a selected column values
, and want to keep only the file that has the most matches with the master file.
The code I actually created give me the results for each file, but I don't know how to make the comparison between the files themselves, and just keep the one with the highest values
sum at the end.
I know how to delete files via os.remove()
and so on, but need help with the selection of the maximum value.
data0 = pd.read_csv('input_path/master_file.csv', sep=',')
csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)
for df in csv_files:
df_base = os.path.basename(df)
input_dir = os.path.dirname(df)
data1 = pd.read_csv(df, sep=',')
comp1 = pd.concat([data0, data1])[['values']]
cnt1 = comp1.loc[comp1.duplicated()]
match1 = cnt1.count(axis=1)
sum = str(sum(match1))
print('Matches between ' df_base ' & ' input_dir ': ' sum)
The print gives me (paths and directories names appear correct):
Matches between ... & ...: 332215
Matches between ... & ...: 273239
Had the idea to try it via sub-lists, but just did not get anywhere.
CodePudding user response:
You could write a function to calculate the "match score" for each file, and use that function as the key
argument for the max
function:
def match_score(csv_file):
df_base = os.path.basename(csv_file)
data1 = pd.read_csv(csv_file, sep=",")
comp1 = pd.concat([data0, data1])[['values']]
cnt1 = comp1.loc[comp1.duplicated()]
match1 = cnt1.count(axis=1)
return match1.sum()
Then,
csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)
max_match_file = max(csv_files, key=match_score)
CodePudding user response:
You can simplify your code a lot using pathlib
.
Addressing your question, you can store the duplicates sum in a dictionary, and after comparing all files, choose the one with most matches. Something like this:
import pandas as pd
from pathlib import Path
main_file = Path('/main/path/main.csv')
main_df = pd.read_csv(main_file)
other_path = Path('/other/path/')
other_files = other_path.rglob('*.csv')
matches_per_file = {}
for other_file in other_files:
other_df = pd.read_csv(other_file)
merged_df = pd.concat([main_df, other_df])[['values']]
dups = merged_df.loc[merged_df.duplicated()]
dups_sum = sum(dups.count(axis=1))
matches_per_file[other_file] = dups_sum
print(f'Matches between {other_file} and {main_file}: {dups_sum}')
# find the file with most matches
most_matches = max(matches_per_file, key=matches_per_file.get)
The code above will populate matches_per_file
with pairs filename
: matches
. That will make it easy for you to find the max(matches)
and the corresponding filename
, and then decide which files you will keep and which ones you will delete. The variable most_matches
will be set with that filename.
Use the code snippet as a starting point, since I don't have the data files to test it properly.