I have an example
───────────── ──────── ──────── ────────
| main_group | COL_A | COL_B | COL_C |
───────────── ──────── ──────── ────────
| 0 | TXT1 | | None |
| 0 | TXT2 | | None |
| 0 | 5 | | None |
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
| 1 | TXT11 | | None |
| 1 | TXT12 | | None |
| 1 | 0.50 | | None |
| 1 | 0.45 | 0.45 | 0 |
| 1 | 0.31 | 0.31 | 1 |
| 1 | 0.35 | 0.35 | 1 |
| 1 | 0.73 | 0.73 | 1 |
| 2 | 0.5 | | None |
| 2 | 4.15 | 4.15 | 0 |
| 2 | 2.98 | 2.98 | 0 |
| 2 | 1.53 | 1.53 | 0 |
| 3 | 4.46 | | None |
| 3 | 4.00 | 4.00 | 0 |
| 3 | 0.95 | 0.95 | 1 |
| 3 | 1.35 | 1.35 | 1 |
| 3 | 1.79 | 1.79 | 1 |
───────────── ──────── ──────── ────────
I would like to move the value from COL_A
for the last occurrence of a row with None
in column COL_C
per main_group
.
This value should be moved to the first element in group for COL_B
column and then the previous contents should be removed.
Here is how it should look like for main_group
== 0:
───────────── ──────── ──────── ────────
| main_group | COL_A | COL_B | COL_C |
───────────── ──────── ──────── ────────
| 0 | TXT1 | 5 | None | <--- value "5" from the last row with "None" in `COL_C` in `main_group` == 0 was moved to the first row in the same group
| 0 | TXT2 | | None |
| 0 | 5 | | None | <--- After that this row should be deleted
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
───────────── ──────── ──────── ────────
───────────── ──────── ──────── ────────
| main_group | COL_A | COL_B | COL_C |
───────────── ──────── ──────── ────────
| 2 | 0.5 | 0.5 | None | <--- value in column `COL_B` should be same as in column `COL_A` because there are no other rows in the same `main_group` with "None" in column `COL_C`
| 2 | 4.15 | 4.15 | 0 |
| 2 | 2.98 | 2.98 | 0 |
| 2 | 1.53 | 1.53 | 0 |
| 3 | 4.46 | 4.46 | None | <--- value in column `COL_B` should be same as in column `COL_A` because there are no other rows in the same `main_group` with "None" in column `COL_C`
| 3 | 4.00 | 4.00 | 0 |
| 3 | 0.95 | 0.95 | 1 |
| 3 | 1.35 | 1.35 | 1 |
| 3 | 1.79 | 1.79 | 1 |
───────────── ──────── ──────── ────────
After this operation, df
should looks like:
───────────── ──────── ──────── ────────
| main_group | COL_A | COL_B | COL_C |
───────────── ──────── ──────── ────────
| 0 | TXT1 | 5 | None |
| 0 | TXT2 | | None |
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
| 1 | TXT11 | 0.50 | None |
| 1 | TXT12 | | None |
| 1 | 0.45 | 0.45 | 0 |
| 1 | 0.31 | 0.31 | 1 |
| 1 | 0.35 | 0.35 | 1 |
| 1 | 0.73 | 0.73 | 1 |
| 2 | 0.5 | 0.5 | None |
| 2 | 4.15 | 4.15 | 0 |
| 2 | 2.98 | 2.98 | 0 |
| 2 | 1.53 | 1.53 | 0 |
| 3 | 4.46 | 4.46 | None |
| 3 | 4.00 | 4.00 | 0 |
| 3 | 0.95 | 0.95 | 1 |
| 3 | 1.35 | 1.35 | 1 |
| 3 | 1.79 | 1.79 | 1 |
───────────── ──────── ──────── ────────
In the last step, I would like to concatenating selected COL_A
column in each main_group
where value in COL_C
is None
.
Example:
───────────── ──────── ──────── ────────
| main_group | COL_A | COL_B | COL_C |
───────────── ──────── ──────── ────────
| 0 | TXT1 | 5 | None |
| 0 | TXT2 | | None |
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
↓↓↓↓↓↓↓↓↓↓
───────────── ──────────── ──────── ────────
| main_group | COL_A | COL_B | COL_C |
───────────── ──────────── ──────── ────────
| 0 | TXT1 TXT2 | 5 | None | <--- If there are more than 1 row with "None" in column `COL_C` in each group, then values in column `COL_A` should be "merged" into one row, and all others should be deleted
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
The first solution I have is to .loc
those rows in the group that have a value "None" in COL_C
column and then assign to the first element (.iloc
) value from the last row.
However, this solution is not quite correct, and I'm additionally convinced that this can be done using .groupby
instead of iterating and searching for elements after each group but I can't do it correctly.
The solution I was able to get this way:
───────────── ──────────── ──────── ────────
| main_group | COL_A | COL_B | COL_C |
───────────── ──────────── ──────── ────────
| 0 | TXT1 TXT2 | 5 | None |
| NaN | NaN | NaN | NaN |
| NaN | NaN | NaN | NaN |
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
This is partially transferred correctly, but there are still NaN
values in the rows that should not be there anymore.
Of course I can delete those rows and re-index df
but this solution relies on loops, which will certainly not be efficient for large df
.
How can I do it with a workaround of these loops over individual groups and swapping values using .loc
and .iloc
?
CodePudding user response:
IIUC, try the following (explanation in comments):
#create indicator column for where COL_C is None
df["indicator"] = df["COL_C"].isnull()
#get the index of the last None value for each main_group
max_null = df.groupby("main_group")["indicator"].transform(lambda x: x.cumsum().idxmax())
#move the COL_A to COL_B for the first index of each group
df["COL_B"] = df["COL_B"].where(df.groupby("main_group").cumcount().ne(0), max_null.map(df["COL_A"]))
# #remove the last rows with None value for each main_group
df = df.drop(max_null.unique()).reset_index(drop=True)
# #concatenate COL_A per main_group
strings = df.groupby("main_group").apply(lambda x: x[x["indicator"]]["COL_A"].str.cat(sep=","))
#assign concatenated strings to COL_A
df["COL_A"] = df["COL_A"].where(~df["indicator"], df["main_group"].map(strings))
#drop duplicates from COL_A per group and drop the indicator column
df = df.drop_duplicates(["main_group","COL_A"]).drop("indicator", axis=1).reset_index(drop=True)
>>> df
main_group COL_A COL_B COL_C
0 0 TXT1,TXT2 5 NaN
1 0 1.93 1.93 0.0
2 0 7.6 7.6 1.0
3 0 2.46 2.46 1.0
4 1 TXT1,TXT12 0.5 NaN
5 1 0.45 0.45 0.0
6 1 0.31 0.31 1.0
7 1 0.35 0.35 1.0
8 1 0.73 0.73 1.0
9 2 4.15 4.15 0.0
10 2 2.98 2.98 0.0
11 2 1.53 1.53 0.0
12 3 4 4.0 0.0
13 3 0.95 0.95 1.0
14 3 1.35 1.35 1.0
15 3 1.79 1.79 1.0
CodePudding user response:
Here's a way to get the final result you've asked for:
print('\nInput df:'); print(df)
df = df.assign(range_index=df.index)
gb = df[df['COL_C'].isna()].groupby(['main_group'])
df2 = pd.concat([
gb.nth(0)['range_index'],
gb.last()['COL_A'].copy().rename('COL_B_update'),
gb['COL_A'].apply(list).str.slice(stop=-1).str.join(' ')
], axis=1).set_index('range_index')
emptyColA = df2['COL_A'].str.len() == 0
df2.loc[emptyColA, 'COL_A'] = df2.loc[emptyColA, 'COL_B_update']
print('\ndf2:'); print(df2)
df = df.join(df2, on='range_index', rsuffix='_list')
print('\ndf just after join():'); print(df)
df.loc[~df.COL_A_list.isna(), 'COL_A_update'] = df.COL_A_list
df.loc[~df['COL_C'].isna(), 'COL_A_update'] = df.COL_A
df = df.loc[~df.COL_A_update.isna()].drop(columns=['range_index', 'COL_A_list'])
print('\ndf after creating COL_A_update, deleting unwanted rows, and dropping intermediate columns range_index and COL_A_list:'); print(df)
df.loc[df['COL_C'].isna(), 'COL_B'] = df.loc[df['COL_C'].isna(), 'COL_B_update']
df.loc[df['COL_C'].isna(), 'COL_A'] = df.loc[df['COL_C'].isna(), 'COL_A_update']
df = df.drop(columns=['COL_B_update', 'COL_A_update']).rename(columns={'COL_A_update':'COL_A'}).reset_index(drop=True)
print('\nOutput df after updating COL_A and COL_B, and dropping intermediate columns COL_A_update and COL_B_update:'); print(df)
Explanation:
- duplicate the index in a new column
range_index
- create a
groupby
object onmain_group
for rows withCOL_C
of None - with the
groupby
object:- use
nth(0)
to get therange_index
value for the first row in each group of contiguousNone
values inCOL_C
- use
last()
andSeries.rename()
to create columnCOL_B_update
with needed values copied fromCOL_A
- use
apply(list)
, as well asslice()
andjoin()
onSeries.str
(a sequence accessor which, confusingly, acts here on alist
rather than astr
) to convert all but one of each group'sCOL_A
values into a space separated string of such values concat
these three Series intodf2
withrange_index
as the index
- use
- use
join
to add to df the new columnsCOL_B_update
and aCOL_A_list
(a renaming ofCOL_A
from df2) - create a new column
COL_A_update
containingCOL_A
list strings for rows that have them, and containingCOL_A
values for rows whereCOL_C
is not None - remove all other rows (i.e., remove all rows but the first in each contiguous block of rows where
COL_C
is None), and drop intermediate columnsrange_index
andCOL_A_list
- update rows where
COL_C
is None inCOL_B
andCOL_A
usingCOL_B_update
andCOL_A_update
, and drop the intermediate columnsCOL_A_update
andCOL_B_update
.
Input:
main_group COL_A COL_B COL_C
0 0 TXT1 None
1 0 TXT2 None
2 0 5 None
3 0 1.93 1.93 0
4 0 7.60 7.60 1
5 0 2.46 2.46 1
6 1 TXT11 None
7 1 TXT12 None
8 1 0.50 None
9 1 0.45 0.45 0
10 1 0.31 0.31 1
11 1 0.35 0.35 1
12 1 0.73 0.73 1
13 2 0.5 None
14 2 4.15 4.15 0
15 2 2.98 2.98 0
16 2 1.53 1.53 0
17 3 4.46 None
18 3 4.00 4.00 0
19 3 0.95 0.95 1
20 3 1.35 1.35 1
21 3 1.79 1.79 1
Here is df2
right before the join
:
COL_B_update COL_A
range_index
0 5 TXT1 TXT2
6 0.50 TXT11 TXT12
13 0.5 0.5
17 4.46 4.46
Here is df
right after the join
, with new columns range_index
, COL_B_update
and COL_A_list
:
main_group COL_A COL_B COL_C range_index COL_B_update COL_A_list
0 0 TXT1 None 0 5 TXT1 TXT2
1 0 TXT2 None 1 NaN NaN
2 0 5 None 2 NaN NaN
3 0 1.93 1.93 0 3 NaN NaN
4 0 7.60 7.60 1 4 NaN NaN
5 0 2.46 2.46 1 5 NaN NaN
6 1 TXT11 None 6 0.50 TXT11 TXT12
7 1 TXT12 None 7 NaN NaN
8 1 0.50 None 8 NaN NaN
9 1 0.45 0.45 0 9 NaN NaN
10 1 0.31 0.31 1 10 NaN NaN
11 1 0.35 0.35 1 11 NaN NaN
12 1 0.73 0.73 1 12 NaN NaN
13 2 0.5 None 13 0.5 0.5
14 2 4.15 4.15 0 14 NaN NaN
15 2 2.98 2.98 0 15 NaN NaN
16 2 1.53 1.53 0 16 NaN NaN
17 3 4.46 None 17 4.46 4.46
18 3 4.00 4.00 0 18 NaN NaN
19 3 0.95 0.95 1 19 NaN NaN
20 3 1.35 1.35 1 20 NaN NaN
21 3 1.79 1.79 1 21 NaN NaN
Here is df
after creating COL_A_update
, deleting unwanted rows, and dropping intermediate columns range_index
and COL_A_list
:
main_group COL_A COL_B COL_C COL_B_update COL_A_update
0 0 TXT1 None 5 TXT1 TXT2
3 0 1.93 1.93 0 NaN 1.93
4 0 7.60 7.60 1 NaN 7.60
5 0 2.46 2.46 1 NaN 2.46
6 1 TXT11 None 0.50 TXT11 TXT12
9 1 0.45 0.45 0 NaN 0.45
10 1 0.31 0.31 1 NaN 0.31
11 1 0.35 0.35 1 NaN 0.35
12 1 0.73 0.73 1 NaN 0.73
13 2 0.5 None 0.5 0.5
14 2 4.15 4.15 0 NaN 4.15
15 2 2.98 2.98 0 NaN 2.98
16 2 1.53 1.53 0 NaN 1.53
17 3 4.46 None 4.46 4.46
18 3 4.00 4.00 0 NaN 4.00
19 3 0.95 0.95 1 NaN 0.95
20 3 1.35 1.35 1 NaN 1.35
21 3 1.79 1.79 1 NaN 1.79
Output after updating COL_A
and COL_B
, and dropping intermediate columns COL_A_update
and COL_B_update
:
main_group COL_A COL_B COL_C
0 0 TXT1 TXT2 5 None
1 0 1.93 1.93 0
2 0 7.60 7.60 1
3 0 2.46 2.46 1
4 1 TXT11 TXT12 0.50 None
5 1 0.45 0.45 0
6 1 0.31 0.31 1
7 1 0.35 0.35 1
8 1 0.73 0.73 1
9 2 0.5 0.5 None
10 2 4.15 4.15 0
11 2 2.98 2.98 0
12 2 1.53 1.53 0
13 3 4.46 4.46 None
14 3 4.00 4.00 0
15 3 0.95 0.95 1
16 3 1.35 1.35 1
17 3 1.79 1.79 1