In Python, how can I read a sheet that has an specific name in a column?-CodePudding

So basically I have a ton of files that change each week that passes by, I want to know if there is a way that I can go ahead and specify to the python script to grab that sheet that contains an specific column name, so for example

For file test.xlsx I have the following structure

sheet 1

columnA	columnB	columnC	Columnd	ColumnE	ColumnF
dsf	sdfas	asdf	sadf	asfdsd	sdfasf
sdfsd	sadfsd	asdfsd	asdfasd	asdfdsf	sdfadf

Sheet 2

jira_tt	alignID	issueID	testID
dsf	sdfas	asdf	sadf

As you can see the excel file has 2 sheets, however this is just an example as some file may have more than 2 sheets or the names of the sheets will change, as stated above I want to read all the sheets in all the files that have the keyword "jira" on their columns, so far I have been able to create a script that reads all the files on the target folder, however I don't have a clue on how to specify the sheet as I needed, here is part of the code that I've created so far

import glob as glob
import pandas as pd

#using glob to get all the files that contains an xlsx extension
ingestor = glob.glob("*.xlsx")

for f in ingestor:
    df = pd.read_excel(f)

df

Any kind of assistance or guidance will be appreciated it.

CodePudding user response：

You should use the sheet_name argument of pandas.read_excel, read here.

For the case that the number of sheets is unknown, specify None to get all worksheets:

pd.read_excel(f, sheet_name=None)

Note that now a dict of DataFrames is returned.

CodePudding user response：

To include all your files as DataFrame's you can create a list to store it and use merge() method to split files in one DataFrame, for example:

import glob as glob
import pandas as pd

ingestor = glob.glob("*.xlsx")

df = reduce(lambda left,right: pd.merge(left,right), [pd.read_excel(data) for data in ingestor])

print(df.isin(['jira']))

if want just files that contains a specific word (like "jira"), you need to evaluate with a conditional using any() method on each iteration and merge data:

ingestor = glob.glob("*.xlsx")

df = pd.DataFrame()
for f in ingestor:
    data = pd.read_excel(f)
    df.insert(data[data.str.contains("jira"))
   
print(df)