Extract File names from directory and classify on the basis of its extension in excel

I am trying to classify based on file extension from a local directory to excel sheet.

Like my input should be: Directory path

My output should be:

excel sheet with different sheets based on extension. Like if the input directory is having 5 .sh files, 8 .py file and so on. On the basis of extension, sheets should be created with file names.

I am able to achieve the same but it is a bit hard coded.

Any help would be appreciated if it can be automated with hard code:

Below is the code i tried and its working fine:

import glob
import pandas as pd
path = r'<path_name>' #base path
files = glob.glob(path   '/**/*.*', recursive=True)

hql, hive, ksh, sh, csv, txt, sql,py = ([] for i in range(8))

for fpath in files:
 chk_file = fpath.split('\\')
 for file_name in chk_file:
  if '.hql' in file_name:
   print("Hql:",file_name)
   comb = f'{file_name}'
   hql.append(comb)
  if '.hive' in file_name:
   print(file_name)
   comb = f'{file_name}'
   hive.append(comb)
  if '.ksh' in file_name:
   print(file_name)
   comb = f'{file_name}'
   ksh.append(comb)
  if '.sh' in file_name:
   print(file_name)
   comb = f'{file_name}'
   sh.append(comb)
  if '.sql' in file_name:
   print(file_name)
   comb = f'{file_name}'
   sql.append(comb)
  if '.txt' in file_name:
   print(file_name)
   comb = f'{file_name}'
   txt.append(comb)
  if '.csv' in file_name:
   print(file_name)
   comb = f'{file_name}'
   csv.append(comb)
  if '.py' in file_name:
   print(file_name)
   comb = f'{file_name}'
   py.append(comb)
writer = pd.ExcelWriter(r'C:\Users\saurabh.arun.kumar\OneDrive - Accenture\Desktop\outfile2.xlsx',
      engine='xlsxwriter')

new_hql = pd.DataFrame(hql,columns=['file'])
new_hive = pd.DataFrame(hive,columns=['file'])
new_sql = pd.DataFrame(sql,columns=['file'])
new_ksh = pd.DataFrame(ksh,columns=['file'])
new_txt = pd.DataFrame(txt,columns=['file'])
new_sh = pd.DataFrame(sh,columns=['file'])
new_csv = pd.DataFrame(csv,columns=['file'])
new_py = pd.DataFrame(py,columns=['file'])

new_hql.to_excel(writer, sheet_name='hql', index=False)
new_hive.to_excel(writer, sheet_name='hive', index=False)
new_sql.to_excel(writer, sheet_name='sql', index=False)
new_ksh.to_excel(writer, sheet_name='ksh', index=False)
new_csv.to_excel(writer, sheet_name='csv', index=False)
new_txt.to_excel(writer, sheet_name='txt', index=False)
new_sh.to_excel(writer, sheet_name='sh', index=False)
new_py.to_excel(writer, sheet_name='py', index=False)

writer.save()
writer.close()
print ("Executed")

This code will work with the extension provided in the code. And i want it should classify by its own reading the extension and created new sheets with the file names.

Hope i am able to explain the scenario.

CodePudding user response：

You can split the extension from a files path by using

fname, fext = os.path.splitext("/what/ever/kind/of/file/this.is.txt")

Use that to create a dict of "ext" -> "list of files". Use the dict to create n dataframes. Write them to excel.

If you only want certain extensions, filter the dict-keys to those you want:

import glob
import pandas as pd
from os import path

p = r'/redacted/location' # fix this to your path
files = glob.glob(p   '/**/*.*', recursive=True)

d = {}
i = 0    # used to redact my file names - you would simply store fn fex
for f in files:
    fn, fex = path.splitext(f)
    # filter for extensions you want
    if (fex in (".txt",".xlsx", ".docx") ):
        # use  d.setdefault(fex,[]).append(f) - I use something 
        #    to blank out my file names here
        # use collections.defaultdict to get a speed kick if needed
        d.setdefault(fex,[]).append(f"file...{i}{fex}")
        i  = 1

# create single data frames per file extension from dictionary
dfs = []
for key,value in d.items():
    df = pd.DataFrame({key:value})
    dfs.append(df)

# do your excel writing here - use column header for sheet name etc.
for df in dfs:
    print (df)

Output (files/names redacted):

            .docx
0    file...0.docx
1    file...2.docx
2    file...3.docx
3    file...4.docx
4    file...5.docx
5    file...6.docx
6    file...7.docx
7   file...12.docx
8   file...13.docx
9   file...14.docx
10  file...15.docx
11  file...16.docx

            .xlsx
0   file...1.xlsx
1   file...8.xlsx
2   file...9.xlsx
3  file...10.xlsx
4  file...11.xlsx
5  file...17.xlsx

You can then use the column header of each single DF to write your excel sheet - something akin to:

with  pd.ExcelWriter('C:/temp/outfile2.xlsx') as writer:
    for df in dfs:
        df.to_excel(writer, sheet_name = df.columns[0])

should do it - can't test that right now.