Home > Enterprise >  Is there a rapid way in Python to Group by a column and lines to split?
Is there a rapid way in Python to Group by a column and lines to split?

Time:11-16

I have a quick question. I want to import a .cvs file, and then i want to export it back. But in some cells there are more informations separated by \n. I manage to separete it but, i want to know is there a more faster way to do it? And after i finish with the data, i want to export it back, with the \n. I know is a bit confusing, i will put the example:

From This .csv read

Column1 Column2 Column3
asd\nfgd 11\n22 Sys1
mada\nchaca\nhera 32\n120\n35 Sys2
yay 344 Sys3

to this

Column1 Column2 Column3
asd 11 Sys1
fgd 22 Sys1
mada 32 Sys2
chaca 120 Sys2
hera 35 Sys2
yay 344 Sys3

and vice-versa.

This is my code ( i did it only one way):

import pandas as pd

tabel=pd.read_csv('Test1.csv')

print(tabel)

Column1=[]
Column2=[]
Column3=[]
for i in range(0,len(tabel.index)):
    sys_n=[tabel.iloc[i, 2]]
    if "\n" in tabel.iloc[i,0] and "\n" in tabel.iloc[i,1]:
        Column1.append(tabel.iloc[i, 0].splitlines())
        Column2.append(tabel.iloc[i, 1].splitlines())
        nr_sys=len(tabel.iloc[i, 0].splitlines())
        Column3.extend([sys_n]*nr_sys)
    else:
        Column1.append(tabel.iloc[i, 0].splitlines())
        Column2.append(tabel.iloc[i, 1].splitlines())
        Column3.extend([sys_n])

print(Column1)
result=sum(Column1,[])
result1=sum(Column2,[])
result3=sum(Column3,[])
tabel2=pd.DataFrame(list(zip(result,result1,result3)), columns=[['Column1','Column2','Column3']])
print(tabel2)

tabel2.to_csv('Test1_1.csv',index=False)

Note: If Column1 has 2 lines, Column2 will have it too.

CodePudding user response:

Given tabel

             Column1      Column2 Column3
0           asd\nfgd       11\n22    Sys1
1  mada\nchaca\nhera  32\n120\n35    Sys2
2                yay          344    Sys3

you could do

tabel2 = pd.DataFrame(
    {"Column1": c_1, "Column2": c_2, "Column3": c_3}
    for col_1, col_2, c_3 in tabel.itertuples(index=False)
    for c_1, c_2 in zip(col_1.split("\n"), col_2.split("\n"))
)

or

tabel2 = pd.DataFrame(
    (
        [c_1, c_2, c_3]
        for col_1, col_2, c_3 in tabel.itertuples(index=False)
        for c_1, c_2 in zip(col_1.split("\n"), col_2.split("\n"))
    ),
    columns=tabel.columns
)

Result:

  Column1 Column2 Column3
0     asd      11    Sys1
1     fgd      22    Sys1
2    mada      32    Sys2
3   chaca     120    Sys2
4    hera      35    Sys2
5     yay     344    Sys3

Given tabel2 you could do

col_1, col_2, col_3 = tabel2.columns
tabel = tabel2.groupby(col_3, as_index=False).agg(
    {col_1: "\n".join, col_2: "\n".join, col_3: "first"}
)

Result:

             Column1      Column2 Column3
0           asd\nfgd       11\n22    Sys1
1  mada\nchaca\nhera  32\n120\n35    Sys2
2                yay          344    Sys3

CodePudding user response:

Assuming every cell of each row has the same number of '\n' characters, you can split each cell on '\n' into a list using Series.str.split. Then, explode each list into a different row using Series.explode. You can apply that logic to every column using Series.apply.

Expanding

>>> df

             Column1      Column2 Column3
0           asd\nfgd       11\n22    Sys1
1  mada\nchaca\nhera  32\n120\n35    Sys2
2                yay          344    Sys3

>>> long = df.apply(lambda col: col.str.split(r'\\n').explode())
>>> long

  Column1 Column2 Column3
0     asd      11    Sys1
0     fgd      22    Sys1
1    mada      32    Sys2
1   chaca     120    Sys2
1    hera      35    Sys2
2     yay     344    Sys3

Compressing

To invert the process you can group the rows by their index (level=0) and concatenate string groups using str.join

>>> short = long.groupby(level=0).agg(r'\n'.join)
>>> short 

             Column1      Column2           Column3
0           asd\nfgd       11\n22        Sys1\nSys1
1  mada\nchaca\nhera  32\n120\n35  Sys2\nSys2\nSys2
2                yay          344              Sys3
  • Related