I want to transform below data in to a pattern of specific rows of 4 cells. Please find the specimen of below data.
text = """A | B | Lorem | Ipsum | is | simply | dummy
C | D | text | of | the | printing | and
E | F | typesetting | industry. | Lorem
G | H | more | recently | with | desktop | publishing | software | like | Aldus
I | J | Ipsum | has | been | the | industry's
K | L | standard | dummy | text | ever | since | the | 1500s
M | N | took | a
O | P | scrambled | it | to | make | a | type | specimen | book"""
I am required to transform each row to only contain not more than 4 cells. Any cells coming after fourth cell should be inserted to a next row having that has first two cells similar to that of first row and current row shouldn't also be greater than 4 cells. The transformation of above text data should look like below one.
A | B | Lorem | Ipsum
A | B | is | simply
A | B | dummy
C | D | text | of
C | D | the | printing
C | D | and
E | F | typesetting | industry.
E | F | Lorem
G | H | more | recently
G | H | with | desktop
G | H | publishing | software
G | H | like | Aldus
.
.
and so on...
I have tried something on my own but I am not even half way there as per below code which is incomplete.
new_text = ""
for i in text.split('\n'):
row = i.split(' | ')
if len(row) == 4:
new_text = new_text i '\n'
elif len(row) > 4:
for j in range(len(row)):
if j < 3:
new_text = new_text row[0] ' | ' row[1] ...
I am unable to figure out the logic to use first two cells if number of cells are higher than 4 in each row.
CodePudding user response:
You could just split the input rows, and then process each row 2 elements at a time. Possible code:
for line in io.StringIO(text):
row = line.strip().split(' | ')
for i in range(2, len(row), 2):
print(' | '.join(row[:2] row[i: i 2]))
it gives as expected:
A | B | Lorem | Ipsum
A | B | is | simply
A | B | dummy
C | D | text | of
C | D | the | printing
C | D | and
E | F | typesetting | industry.
E | F | Lorem
G | H | more | recently
G | H | with | desktop
G | H | publishing | software
G | H | like | Aldus
I | J | Ipsum | has
I | J | been | the
I | J | industry's
K | L | standard | dummy
K | L | text | ever
K | L | since | the
K | L | 1500s
M | N | took | a
O | P | scrambled | it
O | P | to | make
O | P | a | type
O | P | specimen | book
CodePudding user response:
I would use pandas for this task:
import pandas as pd
from io import StringIO
MAX = 100 # expected maximum number of input columns
IDX_COL = 2 # number of index columns (A / B)
N_COLS = 2 # number of desired non-index output columns
df = (pd
.read_csv(io.StringIO(text), sep=r'\s*\|\s*',
engine='python', names=range(MAX))
.set_index(list(range(IDX_COL)))
.pipe(lambda d: d.set_axis(
pd.MultiIndex.from_arrays(
[(d.columns-IDX_COL)%N_COLS,
(d.columns-IDX_COL)//N_COLS]), axis=1)
)
.stack().droplevel(IDX_COL)
.to_csv('output.csv', header=None, sep='|')
)
output file:
A|B|Lorem|Ipsum
A|B|is|simply
A|B|dummy|
C|D|text|of
C|D|the|printing
C|D|and|
E|F|typesetting|industry.
E|F|Lorem|
G|H|more|recently
G|H|with|desktop
G|H|publishing|software
G|H|like|Aldus
I|J|Ipsum|has
I|J|been|the
I|J|industry's|
K|L|standard|dummy
K|L|text|ever
K|L|since|the
K|L|1500s|
M|N|took|a
O|P|scrambled|it
O|P|to|make
O|P|a|type
O|P|specimen|book