Returning all the column names as lists from multiple Parquet Files in Python-CodePudding

I have more than 100 Parquet files in a folder. I am not sure if all the files are having same feature name(column name). I want to write some python codes, through pandas which could read all the file in directory and return the name of columns with file name as prefix.

I tried 'for loop', but not sure how to structure the query. Being a beginner I could not write looped script.

import glob
path = r'C:\Users\NewFOlder1\NewFOlder\Folder' 
all_files = glob.glob(path   '\*.gzip')

col=[]
for paths in all_files:
    
    df=pd.read_parquet(paths)
    col.append(df.columns)
    print(col)

CodePudding user response：

IIUC, use pandas.concat with pandas.DataFrame.columns :

import glob
import pandas as pd

path = r'C:\Users\NewFOlder1\NewFOlder\Folder' 
all_files = glob.glob(path   '\*.gzip')

list_dfs = []
for paths in all_files:
    df = pd.read_parquet(paths)
    list_dfs.append(df)
    
col_names = pd.concat(list_dfs).columns.tolist()

CodePudding user response：

Can you try this:

import glob
import pandas as pd
path = r'C:\Users\NewFOlder1\NewFOlder\Folder' 
all_files = glob.glob(path   '\*.gzip')

col=[]
for paths in all_files:
    
    df=pd.read_parquet(paths)
    col.append(list(df.columns   '_'   paths))
    print(col)

if the filenames are like this: "abcd.parquet" (if not please provide sample of filename), you can try something like this to find the differences:

replaced_cols=[i.split("_",1)[0] for i in col]
differences=[]
for i in col:
    val=i.split("_", 1)[0]
    if not val in replaced_cols:
        differences.append(i)