I want to preface this that I am VERY new to Python so I am very much learning as I go.
Project:
Scrape data from an HTML table, clean the data up and append to a copied template in Excel for a shipping report.
Current Code:
In my first step I get the data via:
import xlwings as xw
import pandas as pd
import openpyxl
#Get order number for API call and create dataframe
wb = xw.books.active
sheet = wb.sheets['Variables']
OrderNum = str(sheet.range('B2').value)
Filepath = sheet.range('B1').value
OrderNum = OrderNum[:-2]
url = "https://extrnlnet.se.com/OrderStatusReport/ORDERPRINT.aspx?Print=EbaseOrderInquiry&OrderNumber={}&PurchID=21510&UserID=21510".format(OrderNum)
df = pd.read_html(url)[0]
I then create a sheet to dump the data into, copy my template and rename it: `
if OrderNum not in wb.sheet_names:
wb.sheets.add(OrderNum)
#create sheet named the same as the order number and paste dataframe if sheet does not already exist
if '{} Report'.format(OrderNum) not in wb.sheet_names:
ws = wb.sheets[OrderNum]
ws.range('A1').options(index=False).value = df
#copy report template to begin processing
ws = wb.sheets['ReportTemplate']
ws.copy(name="{} Report".format(OrderNum))
#set worksheet to created page
ws = wb.sheets['{} Report'.format(OrderNum)]
#set secondary worksheet to pasted dataframe
ws1 = wb.sheets['{}'.format(OrderNum)]
` Here is a snip of the data that gets dumped (cut a couple columns to reduce pictures width):
Column G is my "control" column. If the value <>
"cancelled" I want to append specific columns to my empty copy of the template I just created.
Problem: How would I properly loop through column "G" and append the rows specific columns when the value of column "G" does not equal "Cancelled"? Also, is it best/faster/easier to do dump the data into Excel and do this or should it be done while still in the dataframe?
I have read How to iterate over rows in a DataFrame in Pandas but to be honest, a lot of that went over my head.
I could likely do this with VBA and construct a For Each
loop, but as a challenge and learning experience I want to try and do this all within Python.
EDIT 1:
If column "G" (within excel) or header "Progress Point" in the dataframe has a value of "Cancelled", I want to delete/remove the entire row. There are also 6 columns from the HTML table that I do not need and would also need to delete/remove from the data frame.
Once the data frame is "cleaned up" doing the above, I would then write it to my Excel report.
Here is the "raw" data and here would be the desired output. Several columns were removed along with rows that had the progress point of cancelled.
CodePudding user response:
So basically you are trying to do two things:
You want to delete rows in the table, based on the "Progress Point" column:
df = df[df['Progress Point] != 'Cancelled']
(Get only columns where the value of 'Progress Point' is not 'Cancelled')
You want to drop columns you don't need:
columns_to_drop = ['Line Nbr', 'Total Qty'] # add columns here df = df.drop(columns=columns_to_drop)
After this you can write the data to Excel with:
df.to_excel(path_to_file, sheet_name=sheet_name)
Have a look at the documentation:
Do yourself a favor and get familiar with pandas. You don't need xlwings or openpyxl. You can get the OrderNumber from Excel with pandas.read_excel().