I have a list of files to be imported into the data frame
cdoe:
# list contains the dataset name followed by the column name to match all the datasets; this list keeps changing and even the file formats. These dataset file names are provided by the user, and they are unique.
# First: find the file extension format and select appropriate pd.read_ to import
# second: merge the dataframes on the index
# in the below list,
file_list = ['dataset1.csv','datetime','dataset2.xlsx','timestamp']
df = pd.DataFrame()
for i in range(0:2:len(file_list)):
# find the file type first
# presently, I don't know how to find the file type; so
file_type = 'csv'
# second: merge the dataframe into the existing dataframe on the index
tdf = pd.DataFrame()
if file_type == 'csv':
tdf = pd.read_csv('%s'%(file_list[i])))
if file_type == 'xlsx':
tdf = pd.read_excel('%s'%(file_list[i])))
tdf.set_index('%s'%(file_list[i 1]),inplace=True)
# Merge dataframe with the existing dataframe
df = df.merge(tdf,right_index=True,left_index=True)
I reached this far. Is any direct module available to find the file type? I found magic
but it has issues while importing it. Also, suggest a better approach to merge the files?
Update: Working solution
Inspired from the @ljdyer answer below, I came with the following and this is working perfectly:
def find_file_type_import(file_name):
# Total file extensions possible for importing data
file_type = {'csv':'pd.read_csv(file_name)',
'xlsx':'pd.read_excel(file_name)',
'txt':'pd.read_csv(file_name)',
'parquet':'pd.read_parquet(file_name)',
'json':'pd.read_json(file_name)'
}
df = [eval(val) for key,val in file_type.items() if file_name
.endswith(key)][0]
return df
df = find_file_type_import(file_list [0])
This is working perfectly. Thank you for your valuable suggestions. ALso, correct me with the use of eval
is good one or not?
CodePudding user response:
The file type is just the three or four letters at the end of the file name, so the simplest way to do this would just be:
if file_list[i].endswith('csv'):
etc.
Other commons options would be os.path.splitext
or the suffix
attribute of a Path
object from the built-in os
and pathlib
libraries respectively.
The way you are merging looks fine, but I'm not sure why you are using percent notation for the parameters to read_
, set_index
, etc. The elements of your list are just strings anyway, so for example
tdf = pd.read_csv('%s'%(file_list[i])))
could just be:
tdf = pd.read_csv(file_list[i])
CodePudding user response:
You can use glob (or even just os) to retrieve the list of files from a part of their name. Since you guarantee the uniqueness of the file irrespective of the extension, it will only be one (otherwise just put a loop that iterates over the retrieved elements).
Once you have the full file name (which clearly has the extension), just do a split() taking the last element obtained that corresponds to the file extension.
Then, you can read the dataframe with the appropriate function.
Here is an example of code:
from glob import glob
file_list = [
'dataset0', # corresponds to dataset0.csv
'dataset1', # corresponds to dataset1.xlsx
'dataset2.a'
]
for file in file_list:
files_with_curr_name = glob(f'*{file}*')
if len(files_with_curr_name) > 0:
full_file_name = files_with_curr_name[0] # take the first element, the uniqueness of the file name being guaranteed
# extract the file extension (string after the dot, so the last element of split)
file_type = full_file_name.split(".")[-1]
if file_type == 'csv':
print(f'Read {full_file_name} as csv')
# df = pd.read_csv(full_file_name)
elif file_type == 'xlsx':
print(f'Read {full_file_name} as xlsx')
else:
print(f"Don't read {full_file_name}")
Output will be:
Read dataset0.csv as csv
Read dataset1.xlsx as xlsx
Don't read dataset2.a
CodePudding user response:
Using pathlib and a switch dict to call functions.
from pathlib import Path
import pandas as pd
def main(file: Path) -> None:
caller = {
".csv": read_csv,
".xlsx": read_excel,
".pkl": read_pickle
}
print(caller.get(file.suffix)(file))
def read_csv(file: Path) -> pd.DataFrame:
return pd.read_csv(file)
def read_excel(file: Path) -> pd.DataFrame:
return pd.read_excel(file)
def read_pickle(file: Path) -> pd.DataFrame:
return pd.read_pickle(file)
if __name__ == "__main__":
files = [x for x in Path(f"{Path().home()}/Documents/").glob("*") if x.suffix in [".csv", ".xlsx", ".pkl"]]
[main(x) for x in files]