Home > Software engineering >  Avoiding nested for loop when read files in pandas for comparison
Avoiding nested for loop when read files in pandas for comparison

Time:04-14

I have a dictionary called "file_dic" with the {key:file_path} structure. I want to read in the file path in pandas dataframe, grab the columns, and see if it exists in the other file paths in the dictionary. My solution works, but i want to avoid a nested for loop. What would be the best way to do this? I'm trying to learn better code lol

file_diff = {}
        for i in file_dic.keys():
            temp_col1 = pd.read_csv(file_dic[i], nrows=1).columns.tolist()
            for j in file_dic.keys():
                if (j != i):
                    temp_col2 = pd.read_csv(file_dic[j], nrows=1).columns.tolist()
                    diff_cols = sorted(list(set(temp_col1).difference(set(temp_col2))))
                    file_diff[str(i) ' columns not in ' str(j)] = diff_cols
df = pd.DataFrame.from_dict(file_diff, orient='index').T

CodePudding user response:

As per the comments your second loop isn't necessary, you can use a count variable to check if you are on the first key (first file) and a previous variable to keep track of the file you read on the previous iterations:

file_diff = {}
count = 0
for i in file_dic.keys():
    if count == 0: ## if first file
        previous = pd.read_csv(file_dic[i], nrows=1).columns.tolist()
        previous_key = i
    else:
        temp_col2 = pd.read_csv(file_dic[j], nrows=1).columns.tolist()
        diff_cols = sorted(list(set(previous).difference(set(temp_col2))))
        file_diff[str(previous_key) ' columns not in ' str(i)] = diff_cols
        previous = temp_col2
        previous_key = i
    count  = 1
df = pd.DataFrame.from_dict(file_diff, orient='index').T

This way, previous stores the previous file read and compare it to the new file read (temp_col2)

  • Related