Home > database >  Comparing multiple/more than 2 dataframes and extracting the values that aren't common to all d
Comparing multiple/more than 2 dataframes and extracting the values that aren't common to all d

Time:10-13

So basically I have a Google Cloud bucket with several csv files, the goal is to compare two columns in every one of the csv files and print out any values that don't exist in every single csv file in the bucket.

So far I can list all the files in the bucket and then using a for loop create a dictionary with the contents of each csv file and then converts each to a dataframe.

Now, I need to compare the columns between all the files, and extract/print out any values that are not common to every csv file (dataframe). This is the part that has me stumped as I'm not sure how to do the compare.

This is basically what I have so far, it prints out all the dataframes that I want to compare (in this case four total dataframes):

import pandas as pd, numpy as np
import os
from google.cloud import storage

bucketName = 'test_bucket_01'
bucketFolder = 'test_folder_01'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'test.credentials.json'
storage_client = storage.Client.from_service_account_json('test.credentials.json')
bucket = storage_client.get_bucket(bucketName)
file_location = f'gs://{bucketName}/{bucketFolder}'

storage_blob_list = list(storage_client.list_blobs(bucketName, prefix=bucketFolder))

file_list = [blob_name.name[len(bucketFolder):] for blob_name in storage_blob_list if blob_name.name != bucketFolder and 'String' in blob_name.name]
file_list = [element.replace("/","") for element in file_list]

for fileName in file_list:
    d = {}
    data = pd.read_csv(f'{file_location}/{fileName}',
            usecols=['ColumnA','ColumnB'])
    d[fileName1] = data
    print(d[fileName])

And here is the output of the four csv files/dataframes:

   ColumnA      ColumnB
0  AA-1234  AA-1234-ABC
1  AA-1235  AA-1235-ABC
2  AA-1236  AA-1236-ABC
3  AA-1237  AA-1237-ABC
   ColumnA      ColumnB
0  AA-1234  AA-1234-ABC
1  AA-1235  AA-1235-ABC
2  AA-1236  AA-1236-ABC
3  BB-1237  BB-1237-ABC
   ColumnA      ColumnB
0  AA-1234  AA-1234-ABC
1  AA-1235  AA-1235-ABC
2  AA-1236  AA-1236-ABC
3  CC-1237  CC-1237-ABC
   ColumnA      ColumnB
0  AA-1234  AA-1234-ABC
1  AA-1235  AA-1235-ABC
2  AA-1236  AA-1236-ABC
3  DD-1237  DD-1237-ABC

So what's the easiest way for me to print out/extract only the values that don't exist in all the dataframes?

CodePudding user response:

I believe this will work

df = pd.concat((pd.read_csv(x) for x in file_list), ignore_index=True).drop_duplicates(keep=False)

CodePudding user response:

Does the OP want what is not in every file including things that are in multiple files but not all of them?

fileCount=len(file_list)
df = pd.concat((pd.read_csv(x) for x in file_list), ignore_index=True)
df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'}).query('count < @fileCount')
  • Related