Is there a rapid way in Python to Group by a column and lines to split?-CodePudding

I have a quick question. I want to import a .cvs file, and then i want to export it back. But in some cells there are more informations separated by \n. I manage to separete it but, i want to know is there a more faster way to do it? And after i finish with the data, i want to export it back, with the \n. I know is a bit confusing, i will put the example:

From This .csv read

Column1	Column2	Column3
asd\nfgd	11\n22	Sys1
mada\nchaca\nhera	32\n120\n35	Sys2
yay	344	Sys3

to this

Column1	Column2	Column3
asd	11	Sys1
fgd	22	Sys1
mada	32	Sys2
chaca	120	Sys2
hera	35	Sys2
yay	344	Sys3

and vice-versa.

This is my code ( i did it only one way):

import pandas as pd

tabel=pd.read_csv('Test1.csv')

print(tabel)

Column1=[]
Column2=[]
Column3=[]
for i in range(0,len(tabel.index)):
    sys_n=[tabel.iloc[i, 2]]
    if "\n" in tabel.iloc[i,0] and "\n" in tabel.iloc[i,1]:
        Column1.append(tabel.iloc[i, 0].splitlines())
        Column2.append(tabel.iloc[i, 1].splitlines())
        nr_sys=len(tabel.iloc[i, 0].splitlines())
        Column3.extend([sys_n]*nr_sys)
    else:
        Column1.append(tabel.iloc[i, 0].splitlines())
        Column2.append(tabel.iloc[i, 1].splitlines())
        Column3.extend([sys_n])

print(Column1)
result=sum(Column1,[])
result1=sum(Column2,[])
result3=sum(Column3,[])
tabel2=pd.DataFrame(list(zip(result,result1,result3)), columns=[['Column1','Column2','Column3']])
print(tabel2)

tabel2.to_csv('Test1_1.csv',index=False)

Note: If Column1 has 2 lines, Column2 will have it too.

CodePudding user response：

Given tabel

             Column1      Column2 Column3
0           asd\nfgd       11\n22    Sys1
1  mada\nchaca\nhera  32\n120\n35    Sys2
2                yay          344    Sys3

you could do

tabel2 = pd.DataFrame(
    {"Column1": c_1, "Column2": c_2, "Column3": c_3}
    for col_1, col_2, c_3 in tabel.itertuples(index=False)
    for c_1, c_2 in zip(col_1.split("\n"), col_2.split("\n"))
)

tabel2 = pd.DataFrame(
    (
        [c_1, c_2, c_3]
        for col_1, col_2, c_3 in tabel.itertuples(index=False)
        for c_1, c_2 in zip(col_1.split("\n"), col_2.split("\n"))
    ),
    columns=tabel.columns
)

Result:

  Column1 Column2 Column3
0     asd      11    Sys1
1     fgd      22    Sys1
2    mada      32    Sys2
3   chaca     120    Sys2
4    hera      35    Sys2
5     yay     344    Sys3

Given tabel2 you could do

col_1, col_2, col_3 = tabel2.columns
tabel = tabel2.groupby(col_3, as_index=False).agg(
    {col_1: "\n".join, col_2: "\n".join, col_3: "first"}
)

Result:

             Column1      Column2 Column3
0           asd\nfgd       11\n22    Sys1
1  mada\nchaca\nhera  32\n120\n35    Sys2
2                yay          344    Sys3

CodePudding user response：

Assuming every cell of each row has the same number of '\n' characters, you can split each cell on '\n' into a list using Series.str.split. Then, explode each list into a different row using Series.explode. You can apply that logic to every column using Series.apply.

Expanding

>>> df

             Column1      Column2 Column3
0           asd\nfgd       11\n22    Sys1
1  mada\nchaca\nhera  32\n120\n35    Sys2
2                yay          344    Sys3

>>> long = df.apply(lambda col: col.str.split(r'\\n').explode())
>>> long

  Column1 Column2 Column3
0     asd      11    Sys1
0     fgd      22    Sys1
1    mada      32    Sys2
1   chaca     120    Sys2
1    hera      35    Sys2
2     yay     344    Sys3

Compressing

To invert the process you can group the rows by their index (level=0) and concatenate string groups using str.join

>>> short = long.groupby(level=0).agg(r'\n'.join)
>>> short 

             Column1      Column2           Column3
0           asd\nfgd       11\n22        Sys1\nSys1
1  mada\nchaca\nhera  32\n120\n35  Sys2\nSys2\nSys2
2                yay          344              Sys3