The goal is to merge multiple parquet files into a single Athena table so that I can query them.
I converted two parquet files from csv:
pandas.read_csv('a.csv').to_parquet('a.parquet', index=False)
pandas.read_csv('b.csv').to_parquet('b.parquet', index=False)
The CSV has the format id,name,age
, for example:
1,john,20
2,mark,25
I upload these to the S3 bucket: s3://my-test-bucket
, which is arranged as:
my-test-bucket
-> folder1
-> a.parquet
-> folder2
-> b.parquet
Then created the table in Athena via:
CREATE EXTERNAL TABLE `my_table`(
`id` int,
`name` string,
`age` int
)
STORED AS PARQUET
LOCATION 's3://my-test-bucket'
tblproperties ("parquet.compress"="SNAPPY");
However, when i try to query the table via:
SELECT * FROM my_table;
I get empty rows as my return. Is there something that I am missing?
CodePudding user response:
A possible workaround is by using AWS Data Wrangler:
import awswrangler as wr
path = 's3://my-test-bucket'
# Assume the data has been extracted in dict form from .csv
df = pd.DataFrame({
"id": [1, 2],
"name": ["John", "Jane"],
"age": [10, 11],
"folder": [1, 2]
})
wr.s3.to_parquet(
df=df,
path=path,
dataset=True,
mode="overwrite",
partition_cols=["folder"]
)
This will generate in s3:
my-test-bucket
-> folder=1
-> <some hash>.parquet
-> folder=2
-> <some hash>.parquet