So basically I have a Google Cloud bucket with several csv files, the goal is to compare two columns in every one of the csv files and print out any values that don't exist in every single csv file in the bucket.
So far I can list all the files in the bucket and then using a for loop create a dictionary with the contents of each csv file and then converts each to a dataframe.
Now, I need to compare the columns between all the files, and extract/print out any values that are not common to every csv file (dataframe). This is the part that has me stumped as I'm not sure how to do the compare.
This is basically what I have so far, it prints out all the dataframes that I want to compare (in this case four total dataframes):
import pandas as pd, numpy as np
import os
from google.cloud import storage
bucketName = 'test_bucket_01'
bucketFolder = 'test_folder_01'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'test.credentials.json'
storage_client = storage.Client.from_service_account_json('test.credentials.json')
bucket = storage_client.get_bucket(bucketName)
file_location = f'gs://{bucketName}/{bucketFolder}'
storage_blob_list = list(storage_client.list_blobs(bucketName, prefix=bucketFolder))
file_list = [blob_name.name[len(bucketFolder):] for blob_name in storage_blob_list if blob_name.name != bucketFolder and 'String' in blob_name.name]
file_list = [element.replace("/","") for element in file_list]
for fileName in file_list:
d = {}
data = pd.read_csv(f'{file_location}/{fileName}',
usecols=['ColumnA','ColumnB'])
d[fileName1] = data
print(d[fileName])
And here is the output of the four csv files/dataframes:
ColumnA ColumnB
0 AA-1234 AA-1234-ABC
1 AA-1235 AA-1235-ABC
2 AA-1236 AA-1236-ABC
3 AA-1237 AA-1237-ABC
ColumnA ColumnB
0 AA-1234 AA-1234-ABC
1 AA-1235 AA-1235-ABC
2 AA-1236 AA-1236-ABC
3 BB-1237 BB-1237-ABC
ColumnA ColumnB
0 AA-1234 AA-1234-ABC
1 AA-1235 AA-1235-ABC
2 AA-1236 AA-1236-ABC
3 CC-1237 CC-1237-ABC
ColumnA ColumnB
0 AA-1234 AA-1234-ABC
1 AA-1235 AA-1235-ABC
2 AA-1236 AA-1236-ABC
3 DD-1237 DD-1237-ABC
So what's the easiest way for me to print out/extract only the values that don't exist in all the dataframes?
CodePudding user response:
I believe this will work
df = pd.concat((pd.read_csv(x) for x in file_list), ignore_index=True).drop_duplicates(keep=False)
CodePudding user response:
Does the OP want what is not in every file including things that are in multiple files but not all of them?
fileCount=len(file_list)
df = pd.concat((pd.read_csv(x) for x in file_list), ignore_index=True)
df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'}).query('count < @fileCount')