I have a sample data set which is similar to the one defined below.
dict_1 = {'Id' : [1, 1, 2, 2, 3, 4],
'boolean_val' : [True, False, True, False, True, False],
"sal" : [1000, 2000, 1500, 2500, 3500, 4500]}
test = pd.DataFrame(dict_1)
test.head(10)
I have to create 2 new columns in test dataframe i.e. output_True & output_False based on given conditions:
a) If Id[0] == Id[1] & boolean_val = True then put sal[0](Because this is the value when boolean_val = True) in output_True else "NA".
b) If Id[0] == Id[1] & boolean_val = False then put sal[1](Because this is the value when boolean_val = False) in output_False else "NA".
c) If Id[0] 1= Id[1] & boolean_val == True then put sal value of that row in output_True else if Id[0] 1= Id[1] & boolean_val == False then put sal value of that row in output_False.
If I have not properly framed my question then please check below dataframe output and I want my output to be similar to output_True & output_False as shown below.
dict_1 = {'Id' : [1, 1, 2, 2, 3, 4],
'boolean_val' : [True, False, True, False, True, False],
"sal" : [1000, 2000, 1500, 2500, 3500, 4500],
"output_True" : [1000, "NA", 1500, "NA", 3500, "NA"],
"output_False" : [2000, "NA", 2500, "NA", "NA", 4500]}
output_df = pd.DataFrame(dict_1)
output_df.head(10)
I have tried using np.where() & list comprehension but my output data is not showing me correct value. Can someone please help me with this?
CodePudding user response:
Use loc to assign your values for the boolean column. For the second condition you can use .shift()
and compare your Id[0] == Id[1] values and sum based on that:
dict_1 = {'Id' : [1, 1, 2, 2, 3, 4],
'boolean_val' : [True, False, True, False, True, False],
"sal" : [1000, 2000, 1500, 2500, 3500, 4500]}
test = pd.DataFrame(dict_1)
test
Id boolean_val sal
0 1 True 1000
1 1 False 2000
2 2 True 1500
3 2 False 2500
4 3 True 3500
5 4 False 4500
cond1 = test.boolean_val
test.loc[cond1, 'output_True'] = test.sal
cond2 = (test.Id.shift(-1).eq(test.Id))
test['output_False'] = np.nan
test.loc[cond2, 'output_False'] = test['sal'] test['output_True']
test
Id boolean_val sal output_True output_False
0 1 True 1000 1000.0 2000.0
1 1 False 2000 NaN NaN
2 2 True 1500 1500.0 3000.0
3 2 False 2500 NaN NaN
4 3 True 3500 3500.0 NaN
5 4 False 4500 NaN NaN
CodePudding user response:
Here's a way to get your desired output:
df = test.pivot(index='Id', columns='boolean_val', values='sal')
df = df.assign(boolean_val=df.loc[:,True].notna()).set_index('boolean_val', append=True)
df = df.rename(columns={True:'output_True', False:'output_False'})[['output_True', 'output_False']]
output_df = test.join(df, on=['Id','boolean_val'])
for col in ('output_True', 'output_False'):
output_df[col] = np.where(output_df[col].isna(), "NA", output_df[col].astype(pd.Int64Dtype()))
Output:
Id boolean_val sal output_False output_True
0 1 True 1000 2000 1000
1 1 False 2000 NA NA
2 2 True 1500 2500 1500
3 2 False 2500 NA NA
4 3 True 3500 NA 3500
5 4 False 4500 4500 NA
Explanation:
- use
pivot()
to create an intermediate dataframedf
withTrue
andFalse
columns containing the correspondingsal
values for eachId
- add a
boolean_val
column which containsTrue
unless a given row'sTrue
column isNaN
- set
Id, boolean_val
as the index fordf
- rename the
True
andFalse
columns asoutput_True
andoutput_False
and swap their positions (to match the desired output) - use
join()
to createoutput_df
which istest with added columns
output_Trueand
output_False` - replace
NaN
with the string"NA"
and changesal
values from float to int inoutput_True
andoutput_False
.