Home > Enterprise >  How to loop across columns to calculate intersection of values in wide format data?
How to loop across columns to calculate intersection of values in wide format data?

Time:12-09

I have a wide-format dataset that looks like this

Column A Column B Column C Column D
A B
F C D
E
2 2 1 1

Now, what I would like to do is to calculate the count of the values for all columns where both column C and that specific column are not null. So for instance for A and C, the value would be 0, the value for B and C would be one and so on, basically the intersection and to get this value as a row at the bottom for further manipulation. I would be grateful for any help on this.

So far what I have used is

clmn=list(df) 

for i in clmn: 
  df.loc('Intersection')=df[['Column C',i]].notna().all(1)

It does not turn up an error, but the 'Intersection' row has only NA values. So I am assuming it is a syntax error. I would want a loop through all the columns since there are 200 columns. Perhaps I could use indexing instead but I am not aware of the syntax for referencing the columns in such a scenario.

CodePudding user response:

You can use & with conversion to a numpy array for broadcasting, then sum:

counts = (df.notna() & df['Column C'].notna().to_numpy()[:,None]).sum()

Output:

Column A    0
Column B    1
Column C    1
Column D    1
dtype: int64

As a new row:

out = pd.concat([df, counts.to_frame(name='count').T])

Output:

      Column A Column B Column C Column D
0            A        B      NaN      NaN
1          NaN        F        C        D
2            E      NaN      NaN      NaN
count        0        1        1        1

Used input:

nan = float('nan')
df = pd.DataFrame({'Column A': ['A', nan, 'E'],
                   'Column B': ['B', 'F', nan],
                   'Column C': [nan, 'C', nan],
                   'Column D': [nan, 'D', nan]})
  • Related