Here is my data -
gs://idfy-documents-staging/inhouse-ocr/0824ce2b-9494-4d91-9455-29cf2a3e2a67_0.jpg:
Creation time: Thu, 25 Aug 2022 07:59:35 GMT
Update time: Thu, 25 Aug 2022 07:59:35 GMT
Storage class: STANDARD
Retention Expiration: Mon, 27 Feb 2023 07:59:35 GMT
Content-Length: 44187
Content-Type: image/jpeg
Metadata:
url_id: 0
doc_type: ind_aadhaar
textdetex_fallback: True
request_id: 0824ce2b-9494-4d91-9455-29cf2a3e2a67
roi_width: 548
fallback_reason: Not Readable
is_readable: False
Hash (crc32c): 60IK6A==
Hash (md5): /iNkOna/DUglL7Ny7m7pTA==
ETag: CNKcre3C4fkCEAE=
Generation: 1661414375378514
Metageneration: 1
ACL: []
gs://idfy-documents-staging/inhouse-ocr/34b873e6-3b79-45ad-934e-daaa5b61466a_0.jpg:
Creation time: Thu, 25 Aug 2022 07:12:29 GMT
Update time: Thu, 25 Aug 2022 07:12:29 GMT
Storage class: STANDARD
Retention Expiration: Mon, 27 Feb 2023 07:12:29 GMT
Content-Length: 183774
Content-Type: image/jpeg
Metadata:
url_id: 0
readable_score: 71.0
request_id: 34b873e6-3b79-45ad-934e-daaa5b61466a
doc_type: ind_aadhaar
roi_width: 320
roi_height: 230
textdetex_fallback: False
fallback_reason: NA
is_readable: True
Hash (crc32c): kAQ0ow==
Hash (md5): bbiIx5VXX3BbwrzbARlxYA==
ETag: COa316m44fkCEAE=
Generation: 1661411549109222
Metageneration: 1
ACL: []
gs://idfy-documents-staging/inhouse-ocr/399eed69-61bc-4bc0-900b-22828e19fc45_0.jpg:
Creation time: Thu, 25 Aug 2022 07:12:31 GMT
Update time: Thu, 25 Aug 2022 07:12:31 GMT
Storage class: STANDARD
Retention Expiration: Mon, 27 Feb 2023 07:12:31 GMT
Content-Length: 183774
Content-Type: image/jpeg
Metadata:
is_readable: True
roi_width: 320
fallback_reason: NA
doc_type: ind_aadhaar
textdetex_fallback: False
url_id: 0
readable_score: 71.0
request_id: 399eed69-61bc-4bc0-900b-22828e19fc45
Hash (crc32c): kAQ0ow==
Hash (md5): bbiIx5VXX3BbwrzbARlxYA==
ETag: CI7egqu44fkCEAE=
Generation: 1661411551915790
Metageneration: 1
ACL: []
Everytime there is a line that starts with gs://
, I want to start a new row, all other rows are converted to columns. Below is the expected output for the 1st row -
URL Creation_time Update_time ... Metadata_url_id Metadata_doc_type ... ACL
gs://idfy-documents-staging/inhouse-ocr/0824ce2b-9494-4d91-9455-29cf2a3e2a67_0.jpg Thu, 25 Aug 2022 07:59:35 GMT Thu, 25 Aug 2022 07:59:35 GMT ... 0 ind_aadhaar ... []
.
.
.
There is an answer here, but it won't work as I have variable number of rows to reshape from.
CodePudding user response:
Try (s
is your string from the question):
s = """ ... your string from the question ..."""
last_gs, groups = None, []
for line in map(str.strip, s.splitlines()):
if line == "":
continue
if line.startswith("gs://"):
if last_gs is None:
last_gs = line.strip(":")
groups.append({})
else:
groups[-1]["URL"] = last_gs
key, val = map(str.strip, line.split(":", maxsplit=1))
groups[-1][key] = val
df = pd.DataFrame(groups)
print(df.to_markdown(index=False))
Prints:
URL | Creation time | Update time | Storage class | Retention Expiration | Content-Length | Content-Type | Metadata | url_id | doc_type | textdetex_fallback | request_id | roi_width | fallback_reason | is_readable | Hash (crc32c) | Hash (md5) | ETag | Generation | Metageneration | ACL | readable_score | roi_height |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gs://idfy-documents-staging/inhouse-ocr/0824ce2b-9494-4d91-9455-29cf2a3e2a67_0.jpg | Thu, 25 Aug 2022 07:59:35 GMT | Thu, 25 Aug 2022 07:59:35 GMT | STANDARD | Mon, 27 Feb 2023 07:59:35 GMT | 44187 | image/jpeg | 0 | ind_aadhaar | True | 0824ce2b-9494-4d91-9455-29cf2a3e2a67 | 548 | Not Readable | False | 60IK6A== | /iNkOna/DUglL7Ny7m7pTA== | CNKcre3C4fkCEAE= | 1661414375378514 | 1 | [] | nan | nan | |
gs://idfy-documents-staging/inhouse-ocr/0824ce2b-9494-4d91-9455-29cf2a3e2a67_0.jpg | Thu, 25 Aug 2022 07:12:29 GMT | Thu, 25 Aug 2022 07:12:29 GMT | STANDARD | Mon, 27 Feb 2023 07:12:29 GMT | 183774 | image/jpeg | 0 | ind_aadhaar | False | 34b873e6-3b79-45ad-934e-daaa5b61466a | 320 | NA | True | kAQ0ow== | bbiIx5VXX3BbwrzbARlxYA== | COa316m44fkCEAE= | 1661411549109222 | 1 | [] | 71 | 230 | |
gs://idfy-documents-staging/inhouse-ocr/0824ce2b-9494-4d91-9455-29cf2a3e2a67_0.jpg | Thu, 25 Aug 2022 07:12:31 GMT | Thu, 25 Aug 2022 07:12:31 GMT | STANDARD | Mon, 27 Feb 2023 07:12:31 GMT | 183774 | image/jpeg | 0 | ind_aadhaar | False | 399eed69-61bc-4bc0-900b-22828e19fc45 | 320 | NA | True | kAQ0ow== | bbiIx5VXX3BbwrzbARlxYA== | CI7egqu44fkCEAE= | 1661411551915790 | 1 | [] | 71 | nan |