Pandas : Compare columns of dataframe on basis of position and irrespective of data-CodePudding

I have the following dataframe:

I want to output the data into multiple sheets that will contain partial filled fields.

Partial Sheet 1:

First 3 are the key columns. I want to compare the columns F5,F6,F7,F8 and get the columns which have data in the same index and delete the rest. There can be N number of columns.

In this case, my output should be in the following format:

Partial Sheet 2: Will contains Key columns and one field column (F5) without null values.

Partial Sheet 3: Will contains Key columns and one field column (F7) without null values.

I tried researching a lot however could not find anything substantial.

Any ideas would be appreciated.

CodePudding user response：

Use:

#if missing values are not NaNs
df = df.replace('', np.nan)

#columns from third column
cols = df.columns[3:]

#columns with keys
keycols = df.columns[:3].tolist()

out = cols.copy()
df1 = df.copy()

#in loop create combinations of columns with same index
final_cols = []

for c in out:
    if c in out:
        m = df1[c].notna()
        #test if all columns has not missing values like tested column
        #and also test if rows with missing values like tested column are missed too
        c1=out[df.loc[m, out].notna().all()].intersection(out[df.loc[~m, out].isna().all()])
    
        df1 = df1.drop(c1, axis=1)
        out = out.difference(c1)
        final_cols.append(c1.tolist())

#finally sorted list of columns by length
final_cols = sorted(final_cols, key=len, reverse=True)

#create excel with multiple sheets
with pd.ExcelWriter('output.xlsx', engine='xlsxwriter') as writer:
    for c in final_cols:
        df[keycols   c].dropna(subset=c).to_excel(writer, sheet_name=f'{"_".join(c)}')

CodePudding user response：

Full code

cols = df.filter(like='F').replace('', pd.NA).stack().reset_index(0).groupby(level=0)['level_0'].agg(frozenset) 
common = cols.reset_index().groupby('level_0')['index'].agg(list).to_list()
extra = ['Key1', 'Key2', 'Key3']
for c in common:
    print(f'data_cols_{"-".join(c)}.csv')
    print(df.loc[cols[c[0]], extra c])

Details

step1

You can first use aggregation as frozensets to identify the non-NA indices per column:

cols = (df
 .filter(like='F').replace('', pd.NA)
 .stack().reset_index(0)
 .groupby(level=0)['level_0'].agg(frozenset) 
)

output:

F5    (2, 3, 4, 5)
F6    (0, 1, 2, 3)
F7          (1, 2)
F8    (0, 1, 2, 3)
Name: level_0, dtype: object

step2

Then aggregate again to group the columns with the same indices:

common = cols.reset_index().groupby('level_0')['index'].agg(list).to_list()
# [['F5'], ['F7'], ['F6', 'F8']]

step3

Finally, slice and export:

extra = ['Key1', 'Key2', 'Key3']
for c in common:
    print(f'data_cols_{"-".join(c)}.csv')
    df.loc[cols[c[0]], extra c].to_csv(f'data_cols_{"-".join(c)}.csv')

output:

data_cols_F5.csv
  Key1 Key2 Key3 F5
2   x3   y3   z3  a
3   x4   y4   z4  a
4   x5   y5   z5  a
5   x6   y6   z6  a

data_cols_F7.csv
  Key1 Key2 Key3 F7
1   x2   y2   z2  c
2   x3   y3   z3  c
data_cols_F6-F8.csv

  Key1 Key2 Key3 F6 F8
0   x1   y1   z1  b  d
1   x2   y2   z2  b  d
2   x3   y3   z3  b  d
3   x4   y4   z4  b  d

used input

df = pd.DataFrame({'Key1': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6'],
                   'Key2': ['y1', 'y2', 'y3', 'y4', 'y5', 'y6'],
                   'Key3': ['z1', 'z2', 'z3', 'z4', 'z5', 'z6'],
                   'F5': ['', '', 'a', 'a', 'a', 'a'],
                   'F6': ['b', 'b', 'b', 'b', '', ''],
                   'F7': ['', 'c', 'c', '', '', ''],
                   'F8': ['d', 'd', 'd', 'd', '', '']})

comparison of the different answers:

on 60k rows

# mozway
69.3 ms ± 7.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# jezrael
122 ms ± 9.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

on 600k rows:

# mozway
747 ms ± 69.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# jezrael
1.32 s ± 61.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)