I am trying to classify based on file extension from a local directory to excel sheet.
Like my input should be: Directory path
My output should be:
excel sheet with different sheets based on extension. Like if the input directory is having 5 .sh files, 8 .py file and so on. On the basis of extension, sheets should be created with file names.
I am able to achieve the same but it is a bit hard coded.
Any help would be appreciated if it can be automated with hard code:
Below is the code i tried and its working fine:
import glob
import pandas as pd
path = r'<path_name>' #base path
files = glob.glob(path '/**/*.*', recursive=True)
hql, hive, ksh, sh, csv, txt, sql,py = ([] for i in range(8))
for fpath in files:
chk_file = fpath.split('\\')
for file_name in chk_file:
if '.hql' in file_name:
print("Hql:",file_name)
comb = f'{file_name}'
hql.append(comb)
if '.hive' in file_name:
print(file_name)
comb = f'{file_name}'
hive.append(comb)
if '.ksh' in file_name:
print(file_name)
comb = f'{file_name}'
ksh.append(comb)
if '.sh' in file_name:
print(file_name)
comb = f'{file_name}'
sh.append(comb)
if '.sql' in file_name:
print(file_name)
comb = f'{file_name}'
sql.append(comb)
if '.txt' in file_name:
print(file_name)
comb = f'{file_name}'
txt.append(comb)
if '.csv' in file_name:
print(file_name)
comb = f'{file_name}'
csv.append(comb)
if '.py' in file_name:
print(file_name)
comb = f'{file_name}'
py.append(comb)
writer = pd.ExcelWriter(r'C:\Users\saurabh.arun.kumar\OneDrive - Accenture\Desktop\outfile2.xlsx',
engine='xlsxwriter')
new_hql = pd.DataFrame(hql,columns=['file'])
new_hive = pd.DataFrame(hive,columns=['file'])
new_sql = pd.DataFrame(sql,columns=['file'])
new_ksh = pd.DataFrame(ksh,columns=['file'])
new_txt = pd.DataFrame(txt,columns=['file'])
new_sh = pd.DataFrame(sh,columns=['file'])
new_csv = pd.DataFrame(csv,columns=['file'])
new_py = pd.DataFrame(py,columns=['file'])
new_hql.to_excel(writer, sheet_name='hql', index=False)
new_hive.to_excel(writer, sheet_name='hive', index=False)
new_sql.to_excel(writer, sheet_name='sql', index=False)
new_ksh.to_excel(writer, sheet_name='ksh', index=False)
new_csv.to_excel(writer, sheet_name='csv', index=False)
new_txt.to_excel(writer, sheet_name='txt', index=False)
new_sh.to_excel(writer, sheet_name='sh', index=False)
new_py.to_excel(writer, sheet_name='py', index=False)
writer.save()
writer.close()
print ("Executed")
This code will work with the extension provided in the code. And i want it should classify by its own reading the extension and created new sheets with the file names.
Hope i am able to explain the scenario.
CodePudding user response:
You can split the extension from a files path by using
fname, fext = os.path.splitext("/what/ever/kind/of/file/this.is.txt")
Use that to create a dict of "ext" -> "list of files". Use the dict to create n dataframes. Write them to excel.
If you only want certain extensions, filter the dict-keys to those you want:
import glob
import pandas as pd
from os import path
p = r'/redacted/location' # fix this to your path
files = glob.glob(p '/**/*.*', recursive=True)
d = {}
i = 0 # used to redact my file names - you would simply store fn fex
for f in files:
fn, fex = path.splitext(f)
# filter for extensions you want
if (fex in (".txt",".xlsx", ".docx") ):
# use d.setdefault(fex,[]).append(f) - I use something
# to blank out my file names here
# use collections.defaultdict to get a speed kick if needed
d.setdefault(fex,[]).append(f"file...{i}{fex}")
i = 1
# create single data frames per file extension from dictionary
dfs = []
for key,value in d.items():
df = pd.DataFrame({key:value})
dfs.append(df)
# do your excel writing here - use column header for sheet name etc.
for df in dfs:
print (df)
Output (files/names redacted):
.docx
0 file...0.docx
1 file...2.docx
2 file...3.docx
3 file...4.docx
4 file...5.docx
5 file...6.docx
6 file...7.docx
7 file...12.docx
8 file...13.docx
9 file...14.docx
10 file...15.docx
11 file...16.docx
.xlsx
0 file...1.xlsx
1 file...8.xlsx
2 file...9.xlsx
3 file...10.xlsx
4 file...11.xlsx
5 file...17.xlsx
You can then use the column header of each single DF to write your excel sheet - something akin to:
with pd.ExcelWriter('C:/temp/outfile2.xlsx') as writer:
for df in dfs:
df.to_excel(writer, sheet_name = df.columns[0])
should do it - can't test that right now.