I have a data frame:
id s1 s2 t1 t2
id1 4 7 9 6
id2 6 7 2 3
id3 2 2 7 2
id4 5 9 2 7
id5 6 1 5 1
I want to print, for each id row, the pair of columns with '1' in it, and then the pair of columns with '2' in it (they are before and after values, so then e.g. for id1, i'll compare the before and after list):
So the output should be:
id1 = [4,9]
id2 = [6,2]
id3 = [2,7]
id4 = [5,2]
id5 = [6,5]
and the same for the ts
id1 = [7,6]
id2 = [7,3]
id3 = [2,2]
id4 = [9,7]
id5 = [1,1]
And then I'm going to compare each id list s and t values.
I wrote:
for idx,row in df.iterrows():
id = row['id']
before_row = row.loc[:,row.column.str.contains('1')]
print(before_row)
#then get the after row by doing the same with contains('2')
#then compare the pair of lists
I got the error:
AttributeError: 'Series' object has no attribute 'column'
I understand the error that the column name isn't in the list, but not how to fix it; could someone should be how to do this?
CodePudding user response:
Do not use iterrows
, it is slow. You can use filter
to get columns ending in 1/2, then if really needed you can apply a loop:
# asuming id is the index, else run
#df = df.set_index('id')
df1 = df.filter(regex='1$')
df2 = df.filter(regex='2$')
for idx, row in df1.iterrows():
print(row.to_list())
output:
[4, 9]
[6, 2]
[2, 7]
[5, 2]
[6, 5]
another approach for fully automated splitting:
groups = df.columns.str.extract('(\d $)', expand=False)
dfs = dict(list(df.groupby(groups, axis=1)))
dfs['1']
s1 t1
id
id1 4 9
id2 6 2
id3 2 7
id4 5 2
id5 6 5