I am trying to create new column based on the SOURCE
column value for distinct ID
.
There are multiple records for same IDs with different or same Name as there source are different. I am trying to create new column CNAME
to standardize the Name for each ID based on SOURCE
column value.
For each ID
, CNAME
column should be equal to the NAME
where SOURCE
== A
df:
ID NAME SOURCE
1 ABC B
1 ABC C
1 AXY A
2 XYZ B
2 XYZ A
3 SASA D
3 SASA B
3 SSA A
3 SSA C
4 BVA A
4 BA B
5 NAS A
5 VAN B
Output:
ID NAME SOURCE CNAME
1 ABC B AXY
1 ABC C AXY
1 AXY A AXY
2 XYZ B XYZ
2 XYZ A XYZ
3 SASA D SSA
3 SASA B SSA
3 SSA A SSA
3 SSA C SSA
4 BVA A BVA
4 BA B BVA
5 NAS A NAS
5 VAN B NAS
After grouping, i am not able to understand how to fill back value to new column for all rows. I tried to make use of transform
as well but no luck.
df.groupby('ID').apply(lambda x: np.where(x['SOURCE'] == 'A', x['NAME'],' '))
CodePudding user response:
sort_values, groupby ID and broadcast the first SOURCE in each group using transform
df['CNAME'] = df.sort_values(by=['ID','SOURCE']).groupby('ID')['NAME'].transform('first')
Following your clarifying question you can;
s=df.query("SOURCE=='A'")#Filter all the A now that they are not repeated
df['CNAME'] = df['ID'].map(dict(zip(s['ID'],s['NAME'])))#create dict and map
ID NAME SOURCE CNAME
0 1 ABC B AXY
1 1 ABC C AXY
2 1 AXY A AXY
3 2 XYZ B XYZ
4 2 XYZ A XYZ
5 3 SASA D SSA
6 3 SASA B SSA
7 3 SSA A SSA
8 3 SSA C SSA
9 4 BVA A BVA
10 4 BA B BVA
11 5 NAS A NAS
12 5 VAN B NAS
CodePudding user response:
This is a little "sql-ish" (creating a lookup table if you will, then using it in a join or merge operation) but also works:
# get the list of CNAME ids
ids = df[df.SOURCE == 'A']
# join/merge the two dataframes
new_df = df.merge(ids, on='ID', how='left')
# capture the new columns from the joined dataframe
new_df = new_df[['ID', 'NAME_x', 'SOURCE_x', 'NAME_y']]
# rename the columns
new_df.columns = ['ID', 'NAME', 'SOURCE', 'CNAME']
CodePudding user response:
This is not the proper way of doing it... but it works when you want to select a custom item, like SOURCE B
in this case
mapper = df[df['SOURCE']=='B'].set_index('ID')['NAME']
mapper.name = 'NEWID'
df.merge(mapper, on='ID')
ID NAME SOURCE NEWID
0 1 ABC B ABC
1 1 ABC C ABC
2 1 AXY A ABC
3 2 XYZ B XYZ
4 2 XYZ A XYZ
5 3 SASA D SASA
6 3 SASA B SASA
7 3 SSA A SASA
8 3 SSA C SASA
9 4 BVA A BA
10 4 BA B BA
11 5 NAS A VAN
12 5 VAN B VAN
CodePudding user response:
You can try this:
df = df.sort_values(by=['ID','SOURCE'])
df.loc[df['SOURCE'] == 'A','CNAME'] = df.NAME
df.CNAME = df['CNAME'].ffill()
df
index | ID | NAME | SOURCE | CNAME |
---|---|---|---|---|
2 | 1 | AXY | A | AXY |
0 | 1 | ABC | B | AXY |
1 | 1 | ABC | C | AXY |
4 | 2 | XYZ | A | XYZ |
3 | 2 | XYZ | B | XYZ |
7 | 3 | SSA | A | SSA |
6 | 3 | SASA | B | SSA |
8 | 3 | SSA | C | SSA |
5 | 3 | SASA | D | SSA |
9 | 4 | BVA | A | BVA |
10 | 4 | BA | B | BVA |
11 | 5 | NAS | A | NAS |
12 | 5 | VAN | B | NAS |