I have more than 100 Parquet files in a folder. I am not sure if all the files are having same feature name(column name). I want to write some python codes, through pandas which could read all the file in directory and return the name of columns with file name as prefix.
I tried 'for loop', but not sure how to structure the query. Being a beginner I could not write looped script.
import glob
path = r'C:\Users\NewFOlder1\NewFOlder\Folder'
all_files = glob.glob(path '\*.gzip')
col=[]
for paths in all_files:
df=pd.read_parquet(paths)
col.append(df.columns)
print(col)
CodePudding user response:
IIUC, use pandas.concat
with pandas.DataFrame.columns
:
import glob
import pandas as pd
path = r'C:\Users\NewFOlder1\NewFOlder\Folder'
all_files = glob.glob(path '\*.gzip')
list_dfs = []
for paths in all_files:
df = pd.read_parquet(paths)
list_dfs.append(df)
col_names = pd.concat(list_dfs).columns.tolist()
CodePudding user response:
Can you try this:
import glob
import pandas as pd
path = r'C:\Users\NewFOlder1\NewFOlder\Folder'
all_files = glob.glob(path '\*.gzip')
col=[]
for paths in all_files:
df=pd.read_parquet(paths)
col.append(list(df.columns '_' paths))
print(col)
if the filenames are like this: "abcd.parquet" (if not please provide sample of filename), you can try something like this to find the differences:
replaced_cols=[i.split("_",1)[0] for i in col]
differences=[]
for i in col:
val=i.split("_", 1)[0]
if not val in replaced_cols:
differences.append(i)