I want to read from compressed .json.gz file and write its decoded file into .json file
.json.gz files:
- data/sample1.gz
- data/sample2.gz
write to .json files
- data/sample1.json
- data/sample2.json
CodePudding user response:
I had a requirement where I have a list of compressed json .gz files. I need to uncompress it and convert it back to json files with the same file name. Below mentioned code is working.
Place this script in the folder containing .gz files and run it using python3. It will work.
file: script.py
import gzip
import os
def get_file_names_by_extension(path = ".", file_extension = ".gz"):
file_names = []
for x in os.listdir(path):
if x.endswith(file_extension):
file_names.append(x)
return file_names
def write_file(data, destination_path, file_name, encoding = "utf-8"):
output_file_name = "/".join([destination_path, file_name])
print(output_file_name)
with open(output_file_name, "w") as outfile:
outfile.write(data.encode(encoding))
def decompress_files(files, destination_path, output_format = ".json", encoding = "utf-8"):
for file in files:
_file = gzip.GzipFile(file, "rb")
content = _file.read()
content = content.decode(encoding)
output_file_name = "".join([file.split(".")[0], output_format])
write_file(content, destination_path, output_file_name, encoding)
files = get_file_names_by_extension(path=".", file_extension=".gz")
decompress_files(files, ".", ".json")
CodePudding user response:
Pyspark can infer that the json files are gzipped from the file name. You can read the data then write it back without any compression to get the results you want. The benefit of doing this in Spark is that it can use multiple workers to read/write the data in parallel, especially if the data is in S3.
df = spark.read.json("data/")
df.write.json("data/", mode="append", compression="none")