I have a wide-format dataset that looks like this
Column A | Column B | Column C | Column D |
---|---|---|---|
A | B | ||
F | C | D | |
E | |||
2 | 2 | 1 | 1 |
Now, what I would like to do is to calculate the count of the values for all columns where both column C and that specific column are not null. So for instance for A and C, the value would be 0, the value for B and C would be one and so on, basically the intersection and to get this value as a row at the bottom for further manipulation. I would be grateful for any help on this.
So far what I have used is
clmn=list(df)
for i in clmn:
df.loc('Intersection')=df[['Column C',i]].notna().all(1)
It does not turn up an error, but the 'Intersection' row has only NA values. So I am assuming it is a syntax error. I would want a loop through all the columns since there are 200 columns. Perhaps I could use indexing instead but I am not aware of the syntax for referencing the columns in such a scenario.
CodePudding user response:
You can use &
with conversion to a numpy array for broadcasting, then sum
:
counts = (df.notna() & df['Column C'].notna().to_numpy()[:,None]).sum()
Output:
Column A 0
Column B 1
Column C 1
Column D 1
dtype: int64
As a new row:
out = pd.concat([df, counts.to_frame(name='count').T])
Output:
Column A Column B Column C Column D
0 A B NaN NaN
1 NaN F C D
2 E NaN NaN NaN
count 0 1 1 1
Used input:
nan = float('nan')
df = pd.DataFrame({'Column A': ['A', nan, 'E'],
'Column B': ['B', 'F', nan],
'Column C': [nan, 'C', nan],
'Column D': [nan, 'D', nan]})