How do I load a dataframe into an Excel template on Amazon Web Service's S3?-CodePudding

Issue

I have a dataframe. The template I want to use only has column headings. The column headings in the dataframe are identical to the template column headings. How do I paste the contents of the dataframe into the template excel sheet?

Reprex

Example dataframe

import pandas as pd

data_input = {'Area':['North', 'North', 'North',  'South', 'South', 'South',  'West', 'West', 'West',   'East', "East","East"], 
              "Sub-Area": ["North2", "North1", "North2",   "South2", "South1", "South2","West3", "West9", "West9", "East1", "East4", "East1"], 
              "Workers": [1,20,30, 2,33,5, 3,6,44, 1, 11, 111], 
              "Job1":["T", "T", "T",  "X","T", "T",  "T", "X", "T", "X","T", "T"],
              "Job2":["F", "X", "T",  "X","T", "F",  "T", "X", "F", "X","T", "T"],
              "Job3":["T", "F", "T",  "X","X", "F",  "F", "T","X",  "X","T", "T"]}
 
# Create DataFrame
df1 = pd.DataFrame(data_input)

Attempt #1

# Save dataframe to the template file on S3
with io.BytesIO() as output:
    with pd.ExcelWriter(output, engine='openpyxl') as writer:
        df1.to_excel(writer, sheet_name='Sheet1',startcol = 0, startrow=2)
    data = output.getvalue()
s3 = boto3.resource('s3')
s3.Bucket('main_folder').put_object(Key='sub_folder/template.xlsx', Body=data)

Problem: The above solution just writes my dataset over the template file.

Attempt #2: Appending the dataframe via mode = "a"

# Save dataframe to the template file on S3
# Save file to S3

with io.BytesIO() as output:
# here I add mode = "a"
    with pd.ExcelWriter(output, engine='openpyxl', mode = "a") as writer: 
        df1.to_excel(writer, sheet_name='Sheet1',startcol = 0, startrow=2)
    data = output.getvalue()
s3 = boto3.resource('s3')
s3.Bucket('main_folder').put_object(Key='sub_folder/template.xlsx', Body=data)

Problem: Error Message

BadZipFile: File is not a zip file

Attempt 3

In response to a comment from jsn, I tried to first append the df to the template and then load that to S3 but it overwrote all the formatting of the template again.

#downloading template
template = pd.read_excel('s3://main_folder/sub_folder/template.xlsx', sheet_name="Sheet1")

#appending the dataframe
template = template.append(df1)

#now loading to S3
with io.BytesIO() as output:
    with pd.ExcelWriter(output, engine='openpyxl') as writer:
        template.to_excel(writer, sheet_name='Sheet1')
    data = output.getvalue()
s3 = boto3.resource('s3')
s3.Bucket('main_folder').put_object(Key='sub_folder/template.xlsx', Body=data)

Any help would be appreciated

CodePudding user response：

The pandas library may not be suited to store xlsx formatting state.
The alternative here could be openpyxl library which lets you load a workbook and does integrate with pandas to append your data.

You could attempt to do something like this:

from io import BytesIO
from tempfile import NamedTemporaryFile

import boto3
import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils.dataframe import dataframe_to_rows

# Load Template from S3
bucket_object = boto3.resource('s3').Bucket(bucket_name).Object(object_key)
content = bucket_object.get()['Body'].read()

# Input Data
data_input = {
    'Area': ['North', 'North', 'North', 'South', 'South', 'South', 'West', 'West', 'West', 'East', "East", "East"],
    "Sub-Area": ["North2", "North1", "North2", "South2", "South1", "South2", "West3", "West9", "West9", "East1",
                 "East4", "East1"],
    "Workers": [1, 20, 30, 2, 33, 5, 3, 6, 44, 1, 11, 111],
    "Job1": ["T", "T", "T", "X", "T", "T", "T", "X", "T", "X", "T", "T"],
    "Job2": ["F", "X", "T", "X", "T", "F", "T", "X", "F", "X", "T", "T"],
    "Job3": ["T", "F", "T", "X", "X", "F", "F", "T", "X", "X", "T", "T"]}

# Create DataFrame
df = pd.DataFrame(data_input)

# Load Workbook
wb = load_workbook(filename=(BytesIO(content)))
ws = wb['Sheet1']

# Append contents of Input Data to Workbook
for r in dataframe_to_rows(df, index=False, header=False):
    ws.append(r)

# Save Workbook back to S3
with NamedTemporaryFile() as tmp:
    filename = '/tmp/{}'.format(object_key)
    wb.save(filename)
    s3_resource.Bucket(bucket_name).upload_file(Filename=filename, Key=dest_filename)

Reference material as follows:

Note:

This was tested locally (i.e. without AWS) and the output suggested that formatting applied to the heading columns in the template file remained even after the new data was added.