Home > Blockchain >  Convert variable rows into columns
Convert variable rows into columns

Time:08-26

Here is my data -

gs://idfy-documents-staging/inhouse-ocr/0824ce2b-9494-4d91-9455-29cf2a3e2a67_0.jpg:
    Creation time:          Thu, 25 Aug 2022 07:59:35 GMT
    Update time:            Thu, 25 Aug 2022 07:59:35 GMT
    Storage class:          STANDARD
    Retention Expiration:   Mon, 27 Feb 2023 07:59:35 GMT
    Content-Length:         44187
    Content-Type:           image/jpeg
    Metadata:               
        url_id:             0
        doc_type:           ind_aadhaar
        textdetex_fallback: True
        request_id:         0824ce2b-9494-4d91-9455-29cf2a3e2a67
        roi_width:          548
        fallback_reason:    Not Readable
        is_readable:        False
    Hash (crc32c):          60IK6A==
    Hash (md5):             /iNkOna/DUglL7Ny7m7pTA==
    ETag:                   CNKcre3C4fkCEAE=
    Generation:             1661414375378514
    Metageneration:         1
    ACL:                    []
gs://idfy-documents-staging/inhouse-ocr/34b873e6-3b79-45ad-934e-daaa5b61466a_0.jpg:
    Creation time:          Thu, 25 Aug 2022 07:12:29 GMT
    Update time:            Thu, 25 Aug 2022 07:12:29 GMT
    Storage class:          STANDARD
    Retention Expiration:   Mon, 27 Feb 2023 07:12:29 GMT
    Content-Length:         183774
    Content-Type:           image/jpeg
    Metadata:               
        url_id:             0
        readable_score:     71.0
        request_id:         34b873e6-3b79-45ad-934e-daaa5b61466a
        doc_type:           ind_aadhaar
        roi_width:          320
        roi_height:         230
        textdetex_fallback: False
        fallback_reason:    NA
        is_readable:        True
    Hash (crc32c):          kAQ0ow==
    Hash (md5):             bbiIx5VXX3BbwrzbARlxYA==
    ETag:                   COa316m44fkCEAE=
    Generation:             1661411549109222
    Metageneration:         1
    ACL:                    []
gs://idfy-documents-staging/inhouse-ocr/399eed69-61bc-4bc0-900b-22828e19fc45_0.jpg:
    Creation time:          Thu, 25 Aug 2022 07:12:31 GMT
    Update time:            Thu, 25 Aug 2022 07:12:31 GMT
    Storage class:          STANDARD
    Retention Expiration:   Mon, 27 Feb 2023 07:12:31 GMT
    Content-Length:         183774
    Content-Type:           image/jpeg
    Metadata:               
        is_readable:        True
        roi_width:          320
        fallback_reason:    NA
        doc_type:           ind_aadhaar
        textdetex_fallback: False
        url_id:             0
        readable_score:     71.0
        request_id:         399eed69-61bc-4bc0-900b-22828e19fc45
    Hash (crc32c):          kAQ0ow==
    Hash (md5):             bbiIx5VXX3BbwrzbARlxYA==
    ETag:                   CI7egqu44fkCEAE=
    Generation:             1661411551915790
    Metageneration:         1
    ACL:                    []

Everytime there is a line that starts with gs://, I want to start a new row, all other rows are converted to columns. Below is the expected output for the 1st row -

URL Creation_time   Update_time ... Metadata_url_id Metadata_doc_type   ... ACL
gs://idfy-documents-staging/inhouse-ocr/0824ce2b-9494-4d91-9455-29cf2a3e2a67_0.jpg  Thu, 25 Aug 2022 07:59:35 GMT   Thu, 25 Aug 2022 07:59:35 GMT   ... 0   ind_aadhaar ... []
.
.
.

There is an answer here, but it won't work as I have variable number of rows to reshape from.

CodePudding user response:

Try (s is your string from the question):

s = """ ... your string from the question ..."""

last_gs, groups = None, []
for line in map(str.strip, s.splitlines()):
    if line == "":
        continue

    if line.startswith("gs://"):
        if last_gs is None:
            last_gs = line.strip(":")
        groups.append({})
    else:
        groups[-1]["URL"] = last_gs
        key, val = map(str.strip, line.split(":", maxsplit=1))
        groups[-1][key] = val

df = pd.DataFrame(groups)
print(df.to_markdown(index=False))

Prints:

URL Creation time Update time Storage class Retention Expiration Content-Length Content-Type Metadata url_id doc_type textdetex_fallback request_id roi_width fallback_reason is_readable Hash (crc32c) Hash (md5) ETag Generation Metageneration ACL readable_score roi_height
gs://idfy-documents-staging/inhouse-ocr/0824ce2b-9494-4d91-9455-29cf2a3e2a67_0.jpg Thu, 25 Aug 2022 07:59:35 GMT Thu, 25 Aug 2022 07:59:35 GMT STANDARD Mon, 27 Feb 2023 07:59:35 GMT 44187 image/jpeg 0 ind_aadhaar True 0824ce2b-9494-4d91-9455-29cf2a3e2a67 548 Not Readable False 60IK6A== /iNkOna/DUglL7Ny7m7pTA== CNKcre3C4fkCEAE= 1661414375378514 1 [] nan nan
gs://idfy-documents-staging/inhouse-ocr/0824ce2b-9494-4d91-9455-29cf2a3e2a67_0.jpg Thu, 25 Aug 2022 07:12:29 GMT Thu, 25 Aug 2022 07:12:29 GMT STANDARD Mon, 27 Feb 2023 07:12:29 GMT 183774 image/jpeg 0 ind_aadhaar False 34b873e6-3b79-45ad-934e-daaa5b61466a 320 NA True kAQ0ow== bbiIx5VXX3BbwrzbARlxYA== COa316m44fkCEAE= 1661411549109222 1 [] 71 230
gs://idfy-documents-staging/inhouse-ocr/0824ce2b-9494-4d91-9455-29cf2a3e2a67_0.jpg Thu, 25 Aug 2022 07:12:31 GMT Thu, 25 Aug 2022 07:12:31 GMT STANDARD Mon, 27 Feb 2023 07:12:31 GMT 183774 image/jpeg 0 ind_aadhaar False 399eed69-61bc-4bc0-900b-22828e19fc45 320 NA True kAQ0ow== bbiIx5VXX3BbwrzbARlxYA== CI7egqu44fkCEAE= 1661411551915790 1 [] 71 nan
  • Related