Very new to python but and im pretty sure this is a easier action to code for but im having issues.
I have a directory on a FTP server that has about 1k zip folders each with a csv. The issue is columns have been added over time. All I want to do is get a count of columns in each csv. Having this info will allow me to run the correct ssis package.
I think just counting the first row would be ok, otherwise it would be taking the average for each row in the csv. (The data doesnt wrap strings in "")
any help would be great.
CodePudding user response:
You could start by identifying all zip files in the FTP site, then open them one by one and load all the csv files inside each of those zip files. Once you have loaded the csv files you can then count the number of columns and move on to the next file/zip:
import glob
import pandas as pd
import zipfile
num_columns = 0
directoryPath = "./"
# Go through all zip files in the FTP site
for zip_file_name in glob.glob(directoryPath '*.zip'):
# Identify csv files inside the zip folder
zip = zipfile.ZipFile(zip_file_name)
# list available files in the container
files_in_folder = zip.namelist()
# load all the csv files
for csv_file in files_in_folder:
df = pd.read_csv(zip.open(csv_file))
#count number of columns:
num_columns = len(df.columns)
print("Total number of columns: ", num_columns)