I am using Parquet file format to store the structured version of raw data, received in CSV format. For this, I use Pandas to perform DataFrame manipulation, and Pyarrow to save the DataFrame in Parquet format. I also use pyarrow schema as a means to perform some level of data validation.
My challenge here is to convert a string column to categorical data type, and preferably I want to maintain the original codes in raw data. For example, if original data has column Gender { "M", "F" }
, I want to use "M"
as code for "Male"
, and "F"
as code for "Female"
. I don't want to use assigned numerical codes from pd.Categorical
i.e. 0, 1. This may not be best practice from Kimball methodology perspective which advocates use of surrogate keys as opposed to natural keys, but it's what I want to achieve now.
In example below, I have the codes for EmployeeCategory { "SR", "C", "M" }
and I have to create another column EmployeeCategoryDescription
and store it together with the data. Is it possible to only store the original EmployeeCategory
, but denotes it as a categorical and store its mapping somewhere as Parquet metadata?
import pandas as pd
import pyarrow as pa
df = pd.DataFrame(
{
'Name': ['John', 'James', 'Jack', 'Jonathan', 'Jerry', 'Joseph'],
'EmployeeCategory': ["SR","C","M","SR","M",pd.NA]
})
employeeCategoryMap = {
"SR": "Senior",
"C": "Consultant",
"M": "Manager"
}
df['EmployeeCategoryDescription'] = df['EmployeeCategory'].map(employeeCategoryMap).fillna(pd.NA)
schema = pa.schema([
pa.field('Name', pa.string()),
pa.field('EmployeeCategory', pa.string()),
pa.field('EmployeeCategoryDescription', pa.string())
])
tbl = pa.Table.from_pandas(df, schema)
pq.write_table(tbl, 'test.parquet')
CodePudding user response:
It's not clear to me where the employee category map is coming from. However, if your goal is to store that alongside a column for later use then you could use field metadata. Here is a complete example.
The relevant pieces are:
Storing the metadata in the field metadata.
table = pa.Table.from_pandas(df)
// Field metadata is a map from byte string to byte string
// so we need to serialize the map somehow. Assuming it is
// a fairly simple map then json should work fine.
mapJson = json.dumps(employeeCategoryMap).encode('utf8')
// Fields and tables are immutable so we get a new field and then
// use it to create a new table. I used 'categories' as a key but
// you can use whatever you want and you might want to namespace it
// with your project / company name for uniqueness
annotated_field = table.field('EmployeeCategory').with_metadata({b'categories': mapJson})
// Replacing a field requires the index but you could easily find that
// by name
annotated_table = table.set_column(1, annotated_field, table.column(1))
Retrieving and applying the categories
// We could grab the field directly but I'm pretending you have
// multiple fields with descriptions and don't know ahead of time
// which fields they are
for field in table_from_file.schema:
field_metadata = field.metadata
// Make sure to use b'' strings as keys
if field_metadata is not None and b'categories' in field_metadata:
// Deserialize the category information
categories = json.loads(field_metadata[b'categories'])
df[f'{field.name}Description'] = df[field.name].map(categories).fillna(pd.NA)