I have a problem which might be easy for majority of people here.
I have four folders: SA1, SA2, SA3, SA4. Each folder has around 60 csv files. I have defined the path like this:
my_string = "{0}/folder1/{1}/{1} something/{2}/AUST"
analytics_path = "C:/Users/...../SharePoint - Documents/folder2"
year = "2016" # User should define this
level = "SA3" # User should define this
path = my_string.format(analytics_path, year, level)
Once the user defines the year and level in the path above, I to combine all the csv files under the "level" folder based on the "index_col=" parameter.
For exmaple, for SA1, I want to combine the CSV files based on the "SA1_code" column. For SA2, I want to combine the CSV files based on the "SA2_MAIN_DIGIT_CODE" column. For SA3 and SA4, the index_col should be "SA3_MULTI" and "SA4_REGIONS" respectively. As you can see all the columns names for CSV files under these four folders are different.
So far, I have attempted the following things. I have defined the function as
def combine_csv(path):
"""
Concatenates the csv files and create the huge dataframe combing the information in all the csv files in a given folder
Parameters
----------
Path (string); Location of the folder and files therein
Returns
----------
A dataframe of all the concatenated csv files
"""
# setting the path for joining multiple files
files = os.path.join(path "/*.csv")
# list of merged files returned
files = glob.glob(files)
# joining files with concat and read_csv
list_csv = []
for filename in files:
list_df = pd.read_csv(filename) # Can't give the "index_col" as there four different strings for teh csv files in each folder
list_csv.append(list_df)
df = pd.concat(list_csv, axis=1, ignore_index=False)
return df
data_df = combine_csv(path)
gives me the combined dataframe. But I want to combine it based on "SA1_code" if the user chooses to go to SA1 folder or "SA2_MAIN_DIGIT_CODE" if they choose to combine CSV files from SA2 folder, and so on and so forth.
How do I do this?
CodePudding user response:
You don't have four separate index columns, you just have one that changes depending on user input. Therefore, the solution to your problem is relatively simple. First, modify your combine_csv
method:
def combine_csv(path, index):
"""
Concatenates the csv files and create the huge dataframe combing the information in all the csv files in a given folder
Parameters
----------
Path (string); Location of the folder and files therein
Returns
----------
A dataframe of all the concatenated csv files
"""
# setting the path for joining multiple files
files = os.path.join(path "/*.csv")
# list of merged files returned
files = glob.glob(files)
# joining files with concat and read_csv
list_csv = []
for filename in files:
list_df = pd.read_csv(filename, index_col = index)
list_csv.append(list_df)
df = pd.concat(list_csv, axis=1, ignore_index=False)
return df
All I did was inject a value, index
, that will be used for the indeX_col
argument to read_csv
.
Next, we need to determine the value for index
based on the value for level
, as input by the user. According to your question, it seems that there should be a one-to-one relationship between these values. So, we can use a dictionary for this:
LevelIndexMapping = {
"SA1": "SA1_code",
"SA2": "SA2_MAIN_DIGIT_CODE",
"SA3": "SA3_MULTI",
"SA4": "SA4_REGIONS"
}
my_string = "{0}/folder1/{1}/{1} something/{2}/AUST"
analytics_path = "C:/Users/...../SharePoint - Documents/folder2"
year = "2016" # User should define this
level = "SA3" # User should define this
path = my_string.format(analytics_path, year, level)
combine_csv(path, LevelIndexMapping[level])
Here, I created a dictionary that maps your level variable to its associated index column value, and then accesses that mapping when calling combine_csv
.