So basically I have a ton of files that change each week that passes by, I want to know if there is a way that I can go ahead and specify to the python script to grab that sheet that contains an specific column name, so for example
- For file test.xlsx I have the following structure
sheet 1
columnA | columnB | columnC | Columnd | ColumnE | ColumnF |
---|---|---|---|---|---|
dsf | sdfas | asdf | sadf | asfdsd | sdfasf |
sdfsd | sadfsd | asdfsd | asdfasd | asdfdsf | sdfadf |
Sheet 2
jira_tt | alignID | issueID | testID |
---|---|---|---|
dsf | sdfas | asdf | sadf |
As you can see the excel file has 2 sheets, however this is just an example as some file may have more than 2 sheets or the names of the sheets will change, as stated above I want to read all the sheets in all the files that have the keyword "jira" on their columns, so far I have been able to create a script that reads all the files on the target folder, however I don't have a clue on how to specify the sheet as I needed, here is part of the code that I've created so far
import glob as glob
import pandas as pd
#using glob to get all the files that contains an xlsx extension
ingestor = glob.glob("*.xlsx")
for f in ingestor:
df = pd.read_excel(f)
df
Any kind of assistance or guidance will be appreciated it.
CodePudding user response:
You should use the sheet_name
argument of pandas.read_excel
, read here.
For the case that the number of sheets is unknown, specify None
to get all worksheets:
pd.read_excel(f, sheet_name=None)
Note that now a dict of DataFrames is returned.
CodePudding user response:
To include all your files as DataFrame's
you can create a list to store it and use merge()
method to split files in one DataFrame
, for example:
import glob as glob
import pandas as pd
ingestor = glob.glob("*.xlsx")
df = reduce(lambda left,right: pd.merge(left,right), [pd.read_excel(data) for data in ingestor])
print(df.isin(['jira']))
if want just files that contains a specific word (like "jira"
), you need to evaluate with a conditional using any()
method on each iteration and merge data:
ingestor = glob.glob("*.xlsx")
df = pd.DataFrame()
for f in ingestor:
data = pd.read_excel(f)
df.insert(data[data.str.contains("jira"))
print(df)