How do I transform a text data with cells in a rows separated by pipe sign into specific pattern of-CodePudding

I want to transform below data in to a pattern of specific rows of 4 cells. Please find the specimen of below data.

text = """A | B | Lorem | Ipsum | is | simply | dummy
C | D | text | of | the | printing | and
E | F | typesetting | industry. | Lorem
G | H | more | recently | with | desktop | publishing | software | like | Aldus
I | J | Ipsum | has | been | the | industry's
K | L | standard | dummy | text | ever | since | the | 1500s
M | N | took | a
O | P | scrambled | it | to | make | a | type | specimen | book"""

I am required to transform each row to only contain not more than 4 cells. Any cells coming after fourth cell should be inserted to a next row having that has first two cells similar to that of first row and current row shouldn't also be greater than 4 cells. The transformation of above text data should look like below one.

A | B | Lorem | Ipsum
A | B | is | simply
A | B | dummy
C | D | text | of
C | D | the | printing
C | D | and
E | F | typesetting | industry.
E | F | Lorem
G | H | more | recently
G | H | with | desktop
G | H | publishing | software
G | H | like | Aldus
.
.
and so on...

I have tried something on my own but I am not even half way there as per below code which is incomplete.

new_text = ""

for i in text.split('\n'):
    row = i.split(' | ')
    if len(row) == 4:
        new_text = new_text   i   '\n'
    elif len(row) > 4:
        for j in range(len(row)):
            if j < 3:
                new_text = new_text   row[0]   ' | '   row[1]   ...

I am unable to figure out the logic to use first two cells if number of cells are higher than 4 in each row.

CodePudding user response：

You could just split the input rows, and then process each row 2 elements at a time. Possible code:

for line in io.StringIO(text):
    row = line.strip().split(' | ')
    for i in range(2, len(row), 2):
        print(' | '.join(row[:2]   row[i: i 2]))

it gives as expected:

A | B | Lorem | Ipsum
A | B | is | simply
A | B | dummy
C | D | text | of
C | D | the | printing
C | D | and
E | F | typesetting | industry.
E | F | Lorem
G | H | more | recently
G | H | with | desktop
G | H | publishing | software
G | H | like | Aldus
I | J | Ipsum | has
I | J | been | the
I | J | industry's
K | L | standard | dummy
K | L | text | ever
K | L | since | the
K | L | 1500s
M | N | took | a
O | P | scrambled | it
O | P | to | make
O | P | a | type
O | P | specimen | book

CodePudding user response：

I would use pandas for this task:

import pandas as pd
from io import StringIO

MAX     = 100   # expected maximum number of input columns
IDX_COL = 2     # number of index columns (A / B)
N_COLS  = 2     # number of desired non-index output columns

df = (pd
   .read_csv(io.StringIO(text), sep=r'\s*\|\s*',
             engine='python', names=range(MAX))
   .set_index(list(range(IDX_COL)))
   .pipe(lambda d: d.set_axis(
       pd.MultiIndex.from_arrays(
           [(d.columns-IDX_COL)%N_COLS,
            (d.columns-IDX_COL)//N_COLS]), axis=1)
        )
   .stack().droplevel(IDX_COL)
   .to_csv('output.csv', header=None, sep='|')
)

output file:

A|B|Lorem|Ipsum
A|B|is|simply
A|B|dummy|
C|D|text|of
C|D|the|printing
C|D|and|
E|F|typesetting|industry.
E|F|Lorem|
G|H|more|recently
G|H|with|desktop
G|H|publishing|software
G|H|like|Aldus
I|J|Ipsum|has
I|J|been|the
I|J|industry's|
K|L|standard|dummy
K|L|text|ever
K|L|since|the
K|L|1500s|
M|N|took|a
O|P|scrambled|it
O|P|to|make
O|P|a|type
O|P|specimen|book