Home > front end >  How do I load a dataframe into an Excel template on Amazon Web Service's S3?
How do I load a dataframe into an Excel template on Amazon Web Service's S3?

Time:08-08

Issue

I have a dataframe. The template I want to use only has column headings. The column headings in the dataframe are identical to the template column headings. How do I paste the contents of the dataframe into the template excel sheet?

Reprex

Example dataframe

import pandas as pd

data_input = {'Area':['North', 'North', 'North',  'South', 'South', 'South',  'West', 'West', 'West',   'East', "East","East"], 
              "Sub-Area": ["North2", "North1", "North2",   "South2", "South1", "South2","West3", "West9", "West9", "East1", "East4", "East1"], 
              "Workers": [1,20,30, 2,33,5, 3,6,44, 1, 11, 111], 
              "Job1":["T", "T", "T",  "X","T", "T",  "T", "X", "T", "X","T", "T"],
              "Job2":["F", "X", "T",  "X","T", "F",  "T", "X", "F", "X","T", "T"],
              "Job3":["T", "F", "T",  "X","X", "F",  "F", "T","X",  "X","T", "T"]}
 
# Create DataFrame
df1 = pd.DataFrame(data_input)

Attempt #1

# Save dataframe to the template file on S3
with io.BytesIO() as output:
    with pd.ExcelWriter(output, engine='openpyxl') as writer:
        df1.to_excel(writer, sheet_name='Sheet1',startcol = 0, startrow=2)
    data = output.getvalue()
s3 = boto3.resource('s3')
s3.Bucket('main_folder').put_object(Key='sub_folder/template.xlsx', Body=data)

Problem: The above solution just writes my dataset over the template file.

Attempt #2: Appending the dataframe via mode = "a"

# Save dataframe to the template file on S3
# Save file to S3

with io.BytesIO() as output:
# here I add mode = "a"
    with pd.ExcelWriter(output, engine='openpyxl', mode = "a") as writer: 
        df1.to_excel(writer, sheet_name='Sheet1',startcol = 0, startrow=2)
    data = output.getvalue()
s3 = boto3.resource('s3')
s3.Bucket('main_folder').put_object(Key='sub_folder/template.xlsx', Body=data)

Problem: Error Message

BadZipFile: File is not a zip file

Attempt 3

In response to a comment from jsn, I tried to first append the df to the template and then load that to S3 but it overwrote all the formatting of the template again.

#downloading template
template = pd.read_excel('s3://main_folder/sub_folder/template.xlsx', sheet_name="Sheet1")

#appending the dataframe
template = template.append(df1)

#now loading to S3
with io.BytesIO() as output:
    with pd.ExcelWriter(output, engine='openpyxl') as writer:
        template.to_excel(writer, sheet_name='Sheet1')
    data = output.getvalue()
s3 = boto3.resource('s3')
s3.Bucket('main_folder').put_object(Key='sub_folder/template.xlsx', Body=data)

Any help would be appreciated

CodePudding user response:

  • The pandas library may not be suited to store xlsx formatting state.
  • The alternative here could be openpyxl library which lets you load a workbook and does integrate with pandas to append your data.

You could attempt to do something like this:

from io import BytesIO
from tempfile import NamedTemporaryFile

import boto3
import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils.dataframe import dataframe_to_rows

# Load Template from S3
bucket_object = boto3.resource('s3').Bucket(bucket_name).Object(object_key)
content = bucket_object.get()['Body'].read()

# Input Data
data_input = {
    'Area': ['North', 'North', 'North', 'South', 'South', 'South', 'West', 'West', 'West', 'East', "East", "East"],
    "Sub-Area": ["North2", "North1", "North2", "South2", "South1", "South2", "West3", "West9", "West9", "East1",
                 "East4", "East1"],
    "Workers": [1, 20, 30, 2, 33, 5, 3, 6, 44, 1, 11, 111],
    "Job1": ["T", "T", "T", "X", "T", "T", "T", "X", "T", "X", "T", "T"],
    "Job2": ["F", "X", "T", "X", "T", "F", "T", "X", "F", "X", "T", "T"],
    "Job3": ["T", "F", "T", "X", "X", "F", "F", "T", "X", "X", "T", "T"]}

# Create DataFrame
df = pd.DataFrame(data_input)

# Load Workbook
wb = load_workbook(filename=(BytesIO(content)))
ws = wb['Sheet1']

# Append contents of Input Data to Workbook
for r in dataframe_to_rows(df, index=False, header=False):
    ws.append(r)

# Save Workbook back to S3
with NamedTemporaryFile() as tmp:
    filename = '/tmp/{}'.format(object_key)
    wb.save(filename)
    s3_resource.Bucket(bucket_name).upload_file(Filename=filename, Key=dest_filename)

Reference material as follows:

Note:

This was tested locally (i.e. without AWS) and the output suggested that formatting applied to the heading columns in the template file remained even after the new data was added.

  • Related