Home > database >  Pandas - concatenating selected rows from several groups with additional condition
Pandas - concatenating selected rows from several groups with additional condition

Time:05-25

I have an example

 ───────────── ──────── ──────── ──────── 
| main_group  | COL_A  | COL_B  | COL_C  |
 ───────────── ──────── ──────── ──────── 
| 0           | TXT1   |        | None   |
| 0           | TXT2   |        | None   |
| 0           | 5      |        | None   |
| 0           | 1.93   | 1.93   | 0      |
| 0           | 7.60   | 7.60   | 1      |
| 0           | 2.46   | 2.46   | 1      |
| 1           | TXT11  |        | None   |
| 1           | TXT12  |        | None   |
| 1           | 0.50   |        | None   |
| 1           | 0.45   | 0.45   | 0      |
| 1           | 0.31   | 0.31   | 1      |
| 1           | 0.35   | 0.35   | 1      |
| 1           | 0.73   | 0.73   | 1      |
| 2           | 0.5    |        | None   |
| 2           | 4.15   | 4.15   | 0      |
| 2           | 2.98   | 2.98   | 0      |
| 2           | 1.53   | 1.53   | 0      |
| 3           | 4.46   |        | None   |
| 3           | 4.00   | 4.00   | 0      |
| 3           | 0.95   | 0.95   | 1      |
| 3           | 1.35   | 1.35   | 1      |
| 3           | 1.79   | 1.79   | 1      |
 ───────────── ──────── ──────── ──────── 

I would like to move the value from COL_A for the last occurrence of a row with None in column COL_C per main_group. This value should be moved to the first element in group for COL_B column and then the previous contents should be removed.

Here is how it should look like for main_group == 0:

 ───────────── ──────── ──────── ──────── 
| main_group  | COL_A  | COL_B  | COL_C  |
 ───────────── ──────── ──────── ──────── 
| 0           | TXT1   | 5      | None   | <--- value "5" from the last row with "None" in `COL_C` in `main_group` == 0 was moved to the first row in the same group
| 0           | TXT2   |        | None   |
| 0           | 5      |        | None   | <--- After that this row should be deleted
| 0           | 1.93   | 1.93   | 0      |
| 0           | 7.60   | 7.60   | 1      |
| 0           | 2.46   | 2.46   | 1      |
 ───────────── ──────── ──────── ──────── 
 ───────────── ──────── ──────── ──────── 
| main_group  | COL_A  | COL_B  | COL_C  |
 ───────────── ──────── ──────── ──────── 
| 2           | 0.5    | 0.5    | None   | <---  value in column `COL_B` should be same as in column `COL_A` because there are no other rows in the same `main_group` with "None" in column `COL_C`
| 2           | 4.15   | 4.15   | 0      |
| 2           | 2.98   | 2.98   | 0      |
| 2           | 1.53   | 1.53   | 0      |
| 3           | 4.46   | 4.46   | None   | <---  value in column `COL_B` should be same as in column `COL_A` because there are no other rows in the same `main_group` with "None" in column `COL_C`
| 3           | 4.00   | 4.00   | 0      |
| 3           | 0.95   | 0.95   | 1      |
| 3           | 1.35   | 1.35   | 1      |
| 3           | 1.79   | 1.79   | 1      |
 ───────────── ──────── ──────── ──────── 

After this operation, df should looks like:

 ───────────── ──────── ──────── ──────── 
| main_group  | COL_A  | COL_B  | COL_C  |
 ───────────── ──────── ──────── ──────── 
| 0           | TXT1   | 5      | None   |
| 0           | TXT2   |        | None   |
| 0           | 1.93   | 1.93   | 0      |
| 0           | 7.60   | 7.60   | 1      |
| 0           | 2.46   | 2.46   | 1      |
| 1           | TXT11  | 0.50   | None   |
| 1           | TXT12  |        | None   |
| 1           | 0.45   | 0.45   | 0      |
| 1           | 0.31   | 0.31   | 1      |
| 1           | 0.35   | 0.35   | 1      |
| 1           | 0.73   | 0.73   | 1      |
| 2           | 0.5    | 0.5    | None   |
| 2           | 4.15   | 4.15   | 0      |
| 2           | 2.98   | 2.98   | 0      |
| 2           | 1.53   | 1.53   | 0      |
| 3           | 4.46   | 4.46   | None   |
| 3           | 4.00   | 4.00   | 0      |
| 3           | 0.95   | 0.95   | 1      |
| 3           | 1.35   | 1.35   | 1      |
| 3           | 1.79   | 1.79   | 1      |
 ───────────── ──────── ──────── ──────── 

In the last step, I would like to concatenating selected COL_A column in each main_group where value in COL_C is None.

Example:

 ───────────── ──────── ──────── ──────── 
| main_group  | COL_A  | COL_B  | COL_C  |
 ───────────── ──────── ──────── ──────── 
| 0           | TXT1   | 5      | None   |
| 0           | TXT2   |        | None   |
| 0           | 1.93   | 1.93   | 0      |
| 0           | 7.60   | 7.60   | 1      |
| 0           | 2.46   | 2.46   | 1      |

↓↓↓↓↓↓↓↓↓↓

 ───────────── ──────────── ──────── ──────── 
| main_group  | COL_A      | COL_B  | COL_C  |
 ───────────── ──────────── ──────── ──────── 
| 0           | TXT1 TXT2  | 5      | None   | <--- If there are more than 1 row with "None" in column `COL_C` in each group, then values in column `COL_A` should be "merged" into one row, and all others should be deleted
| 0           | 1.93       | 1.93   | 0      |
| 0           | 7.60       | 7.60   | 1      |
| 0           | 2.46       | 2.46   | 1      |

The first solution I have is to .loc those rows in the group that have a value "None" in COL_C column and then assign to the first element (.iloc) value from the last row. However, this solution is not quite correct, and I'm additionally convinced that this can be done using .groupby instead of iterating and searching for elements after each group but I can't do it correctly.

The solution I was able to get this way:

 ───────────── ──────────── ──────── ──────── 
| main_group  | COL_A      | COL_B  | COL_C  |
 ───────────── ──────────── ──────── ──────── 
| 0           | TXT1 TXT2  | 5      | None   |
| NaN         | NaN        | NaN    | NaN    |
| NaN         | NaN        | NaN    | NaN    |
| 0           | 1.93       | 1.93   | 0      |
| 0           | 7.60       | 7.60   | 1      |
| 0           | 2.46       | 2.46   | 1      |

This is partially transferred correctly, but there are still NaN values in the rows that should not be there anymore. Of course I can delete those rows and re-index df but this solution relies on loops, which will certainly not be efficient for large df.

How can I do it with a workaround of these loops over individual groups and swapping values using .loc and .iloc?

CodePudding user response:

IIUC, try the following (explanation in comments):

#create indicator column for where COL_C is None
df["indicator"] = df["COL_C"].isnull()

#get the index of the last None value for each main_group
max_null = df.groupby("main_group")["indicator"].transform(lambda x: x.cumsum().idxmax())

#move the COL_A to COL_B for the first index of each group
df["COL_B"] = df["COL_B"].where(df.groupby("main_group").cumcount().ne(0), max_null.map(df["COL_A"]))

# #remove the last rows with None value for each main_group
df = df.drop(max_null.unique()).reset_index(drop=True)

# #concatenate COL_A per main_group
strings = df.groupby("main_group").apply(lambda x: x[x["indicator"]]["COL_A"].str.cat(sep=","))

#assign concatenated strings to COL_A
df["COL_A"] = df["COL_A"].where(~df["indicator"], df["main_group"].map(strings))

#drop duplicates from COL_A per group and drop the indicator column
df = df.drop_duplicates(["main_group","COL_A"]).drop("indicator", axis=1).reset_index(drop=True)

>>> df
    main_group       COL_A COL_B  COL_C
0            0   TXT1,TXT2     5    NaN
1            0        1.93  1.93    0.0
2            0         7.6   7.6    1.0
3            0        2.46  2.46    1.0
4            1  TXT1,TXT12   0.5    NaN
5            1        0.45  0.45    0.0
6            1        0.31  0.31    1.0
7            1        0.35  0.35    1.0
8            1        0.73  0.73    1.0
9            2        4.15  4.15    0.0
10           2        2.98  2.98    0.0
11           2        1.53  1.53    0.0
12           3           4   4.0    0.0
13           3        0.95  0.95    1.0
14           3        1.35  1.35    1.0
15           3        1.79  1.79    1.0

CodePudding user response:

Here's a way to get the final result you've asked for:

print('\nInput df:'); print(df)

df = df.assign(range_index=df.index)
gb = df[df['COL_C'].isna()].groupby(['main_group'])
df2 = pd.concat([
    gb.nth(0)['range_index'], 
    gb.last()['COL_A'].copy().rename('COL_B_update'), 
    gb['COL_A'].apply(list).str.slice(stop=-1).str.join(' ')
    ], axis=1).set_index('range_index')
emptyColA = df2['COL_A'].str.len() == 0
df2.loc[emptyColA, 'COL_A'] = df2.loc[emptyColA, 'COL_B_update']

print('\ndf2:'); print(df2)

df = df.join(df2, on='range_index', rsuffix='_list')

print('\ndf just after join():'); print(df)

df.loc[~df.COL_A_list.isna(), 'COL_A_update'] = df.COL_A_list
df.loc[~df['COL_C'].isna(), 'COL_A_update'] = df.COL_A
df = df.loc[~df.COL_A_update.isna()].drop(columns=['range_index', 'COL_A_list'])

print('\ndf after creating COL_A_update, deleting unwanted rows, and dropping intermediate columns range_index and COL_A_list:'); print(df)

df.loc[df['COL_C'].isna(), 'COL_B'] = df.loc[df['COL_C'].isna(), 'COL_B_update']
df.loc[df['COL_C'].isna(), 'COL_A'] = df.loc[df['COL_C'].isna(), 'COL_A_update']
df = df.drop(columns=['COL_B_update', 'COL_A_update']).rename(columns={'COL_A_update':'COL_A'}).reset_index(drop=True)

print('\nOutput df after updating COL_A and COL_B, and dropping intermediate columns COL_A_update and COL_B_update:'); print(df)

Explanation:

  • duplicate the index in a new column range_index
  • create a groupby object on main_group for rows with COL_C of None
  • with the groupby object:
    • use nth(0) to get the range_index value for the first row in each group of contiguous None values in COL_C
    • use last() and Series.rename() to create column COL_B_update with needed values copied from COL_A
    • use apply(list), as well as slice() and join() on Series.str (a sequence accessor which, confusingly, acts here on a list rather than a str) to convert all but one of each group's COL_A values into a space separated string of such values
    • concat these three Series into df2 with range_index as the index
  • use join to add to df the new columns COL_B_update and a COL_A_list (a renaming of COL_A from df2)
  • create a new column COL_A_update containing COL_A list strings for rows that have them, and containing COL_A values for rows where COL_C is not None
  • remove all other rows (i.e., remove all rows but the first in each contiguous block of rows where COL_C is None), and drop intermediate columns range_index and COL_A_list
  • update rows where COL_C is None in COL_B and COL_A using COL_B_update and COL_A_update, and drop the intermediate columns COL_A_update and COL_B_update.

Input:

   main_group  COL_A COL_B COL_C
0           0   TXT1        None
1           0   TXT2        None
2           0      5        None
3           0   1.93  1.93     0
4           0   7.60  7.60     1
5           0   2.46  2.46     1
6           1  TXT11        None
7           1  TXT12        None
8           1   0.50        None
9           1   0.45  0.45     0
10          1   0.31  0.31     1
11          1   0.35  0.35     1
12          1   0.73  0.73     1
13          2    0.5        None
14          2   4.15  4.15     0
15          2   2.98  2.98     0
16          2   1.53  1.53     0
17          3   4.46        None
18          3   4.00  4.00     0
19          3   0.95  0.95     1
20          3   1.35  1.35     1
21          3   1.79  1.79     1

Here is df2 right before the join:

            COL_B_update        COL_A
range_index
0                      5    TXT1 TXT2
6                   0.50  TXT11 TXT12
13                   0.5          0.5
17                  4.46         4.46

Here is df right after the join, with new columns range_index, COL_B_update and COL_A_list:

   main_group  COL_A COL_B COL_C  range_index COL_B_update   COL_A_list
0           0   TXT1        None            0            5    TXT1 TXT2
1           0   TXT2        None            1          NaN          NaN
2           0      5        None            2          NaN          NaN
3           0   1.93  1.93     0            3          NaN          NaN
4           0   7.60  7.60     1            4          NaN          NaN
5           0   2.46  2.46     1            5          NaN          NaN
6           1  TXT11        None            6         0.50  TXT11 TXT12
7           1  TXT12        None            7          NaN          NaN
8           1   0.50        None            8          NaN          NaN
9           1   0.45  0.45     0            9          NaN          NaN
10          1   0.31  0.31     1           10          NaN          NaN
11          1   0.35  0.35     1           11          NaN          NaN
12          1   0.73  0.73     1           12          NaN          NaN
13          2    0.5        None           13          0.5          0.5
14          2   4.15  4.15     0           14          NaN          NaN
15          2   2.98  2.98     0           15          NaN          NaN
16          2   1.53  1.53     0           16          NaN          NaN
17          3   4.46        None           17         4.46         4.46
18          3   4.00  4.00     0           18          NaN          NaN
19          3   0.95  0.95     1           19          NaN          NaN
20          3   1.35  1.35     1           20          NaN          NaN
21          3   1.79  1.79     1           21          NaN          NaN

Here is df after creating COL_A_update, deleting unwanted rows, and dropping intermediate columns range_index and COL_A_list:

   main_group  COL_A COL_B COL_C COL_B_update COL_A_update
0           0   TXT1        None            5    TXT1 TXT2
3           0   1.93  1.93     0          NaN         1.93
4           0   7.60  7.60     1          NaN         7.60
5           0   2.46  2.46     1          NaN         2.46
6           1  TXT11        None         0.50  TXT11 TXT12
9           1   0.45  0.45     0          NaN         0.45
10          1   0.31  0.31     1          NaN         0.31
11          1   0.35  0.35     1          NaN         0.35
12          1   0.73  0.73     1          NaN         0.73
13          2    0.5        None          0.5          0.5
14          2   4.15  4.15     0          NaN         4.15
15          2   2.98  2.98     0          NaN         2.98
16          2   1.53  1.53     0          NaN         1.53
17          3   4.46        None         4.46         4.46
18          3   4.00  4.00     0          NaN         4.00
19          3   0.95  0.95     1          NaN         0.95
20          3   1.35  1.35     1          NaN         1.35
21          3   1.79  1.79     1          NaN         1.79

Output after updating COL_A and COL_B, and dropping intermediate columns COL_A_update and COL_B_update:

   main_group        COL_A COL_B COL_C
0           0    TXT1 TXT2     5  None
1           0         1.93  1.93     0
2           0         7.60  7.60     1
3           0         2.46  2.46     1
4           1  TXT11 TXT12  0.50  None
5           1         0.45  0.45     0
6           1         0.31  0.31     1
7           1         0.35  0.35     1
8           1         0.73  0.73     1
9           2          0.5   0.5  None
10          2         4.15  4.15     0
11          2         2.98  2.98     0
12          2         1.53  1.53     0
13          3         4.46  4.46  None
14          3         4.00  4.00     0
15          3         0.95  0.95     1
16          3         1.35  1.35     1
17          3         1.79  1.79     1
  • Related