Simple dataframe as below, and I want to print the rows of the same index in 1 sorted line.
Wanted:
123@1st line; 2nd line; 3rd line; 4th line
456@a1st line; a2nd line; a3rd line
I've tried:
import pandas as pd
from io import StringIO
csvfile = StringIO(
"""Group Sort String
123 207 1st line
123 208 2nd line
123 209 3rd line
123 210 4th line
456 625 a1st line
456 626 a2nd line
456 627 a3rd line""")
df = pd.read_csv(csvfile, sep = '\t', engine='python')
df['Group'] = df['Group'].astype(str)
df['Sort'] = df['Sort'].astype(int)
df = df.sort_values(['Sort'], ascending=[True])
df_1 = df.groupby(['Group'])['String'].apply(lambda x: '; '.join(x.astype(str))).reset_index()
df_1['String'] = df_1['String'].str.split('; ').apply(set).str.join('; ')
for index, row in df_1.iterrows():
print(row['Group'] '@' row['String'])
But the output is:
123@3rd line; 4th line; 1st line; 2nd line
456@a3rd line; a2nd line; a1st line
Where went wrong and how can I correct it? Thank you.
CodePudding user response:
The problem is that your data are ill-formatted.
Either you keep \t as a separator (what you did in the header), and add \t between each of your columns, or you change the separator as " " (what you did in the body of your dataframe).
Using both will lead to non-necessary formatting problems :
Space separator
csvfile = StringIO(
"""Group Sort String
123 207 1st_line
123 208 2nd_line
123 209 3rd_line
123 210 4th_line
456 625 a1st_line
456 626 a2nd_line
456 627 a3rd_line""")
df = pd.read_csv(csvfile, header = 0, sep = ' ', engine='python')
Tab separator
csvfile = StringIO(
"""Group\tSort\tString
123\t207\t1st line
123\t208\t2nd line
123\t209\t3rd line
123\t210\t4th line
456\t625\ta1st line
456\t626\ta2nd line
456\t627\ta3rd line""")
df = pd.read_csv(csvfile, header = 0, sep = "\t", engine='python')
After that operation, you can check with a debugger for instance, your newly created dataframe as three columns, which wasn't the case with your previous code.
Data Grouping
Then to print all "String" value sharing the same "Group" value, use groupby.
df.groupby("Group").groups
Out[21]: {123: [0, 1, 2, 3], 456: [4, 5, 6]}
Or for better visualization :
for name, group in df.groupby("Group"):
print(f"{name} : {group.loc[:, 'String'].to_numpy()}\n")
123 : ['1st line' '2nd line' '3rd line' '4th line']
456 : ['a1st line' 'a2nd line' 'a3rd line']
Following your guidelines :
for name, group in df.groupby("Group"):
print(f"{name}@{'; '.join(group.loc[:, 'String'].to_numpy())}\n")
123@1st line; 2nd line; 3rd line; 4th line
456@a1st line; a2nd line; a3rd line
CodePudding user response:
the problem is with set
in
df_1['String'] = df_1['String'].str.split('; ').apply(set).str.join('; ')
because sets are unordered. So you need to replace it with something like
df = df.drop_duplicates(subset=['String'])
at the beginning