I want to compare two pandas list columns and compare their lengths and if one column has greater length than the other, I want to pad it with None
value. Is there a pandas function that can do that?
Example, if I have this df below:
df = pd.DataFrame(
{
"address": [['123 Road', '456 Road'], ['789 Road'], ['962 Road', '875 Road', '777 Road']],
"house_id": [['123', '456'], ['789', '987'], ['962']]
}
)
After applying said function, I want my resulting df to look like:
result_df = pd.DataFrame(
{
"address": [['123 Road', '456 Road'], ['789 Road', None], ['962 Road', '875 Road', '777 Road']],
"house_id": [['123', '456'], ['789', '987'], ['962', None, None]]
}
)
Any missing element from either column gets padded with a None value. How can I do this?
Please advise.
CodePudding user response:
One can use pandas.DataFrame.apply
with a custom lambda function as follows
df['house_id'] = df.apply(lambda x: x['house_id'] [None] * (len(x['address']) - len(x['house_id'])), axis=1)
[Out]:
address house_id
0 [123 Road, 456 Road] [123, 456]
1 [789 Road] [789, 987]
2 [962 Road, 875 Road, 777 Road] [962, None, None]
Then for the column address
one can use a similar logic
df['address'] = df.apply(lambda x: x['address'] [None] * (len(x['house_id']) - len(x['address'])), axis=1)
[Out]:
address house_id
0 [123 Road, 456 Road] [123, 456]
1 [789 Road, None] [789, 987]
2 [962 Road, 875 Road, 777 Road] [962, None, None]
Notes:
- Basically what the lambda function does, for the first case, is to add a
None
value to the columnhouse_id
until the length of thehouse_id
column is equal to the length of theaddress
column.
CodePudding user response:
For improve performance avoid DataFrame.apply
, here is solution with list comprehension and custom function, also is possible working with multiple columns defined in list cols
:
cols = ['address','house_id']
def f(x):
max_len = len(max(x, key=len))
return [y [None] * (max_len - len(y)) for y in x]
df[cols] = [f(x) for x in df[cols].to_numpy()]
print (df)
address house_id
0 [123 Road, 456 Road] [123, 456]
1 [789 Road, None] [789, 987]
2 [962 Road, 875 Road, 777 Road] [962, None, None]
df = pd.DataFrame(
{
"address": [['123 Road', '456 Road'], ['789 Road'],
['962 Road', '875 Road', '777 Road']],
"house_id": [['123', '456'], ['789', '987'], ['962']],
"house_id1": [['456'], ['789', '987'], []]
}
)
cols = ['address','house_id', 'house_id1']
def f(x):
max_len = len(max(x, key=len))
return [y [None] * (max_len - len(y)) for y in x]
df[cols] = [f(x) for x in df[cols].to_numpy()]
print (df)
address house_id house_id1
0 [123 Road, 456 Road] [123, 456] [456, None]
1 [789 Road, None] [789, 987] [789, 987]
2 [962 Road, 875 Road, 777 Road] [962, None, None] [None, None, None]
CodePudding user response:
A somewhat simpler, even though less elegant way is this:
import pandas as pd
df = pd.DataFrame(
{
"address": [['123 Road', '456 Road'], ['789 Road'], ['962 Road', '875 Road', '777 Road']],
"house_id": [['123', '456'], ['789', '987'], ['962']]
}
)
for index in range(len(df)):
len_diff=abs(len(df.loc[index,"address"])-len(df.loc[index,"house_id"]))
if len_diff!=0:
if len(df.loc[index,"address"])<len(df.loc[index,"house_id"]):
df.loc[index,"address"]=df.loc[index,"address"] ["None"]*len_diff
else:
df.loc[index,"house_id"]=df.loc[index,"house_id"] ["None"]*len_diff
print(df)
result_df
address house_id
0 [123 Road, 456 Road] [123, 456]
1 [789 Road, None] [789, 987]
2 [962 Road, 875 Road, 777 Road] [962, None, None]
CodePudding user response:
What about something like this?
dft = df.T # transpose
for col in dft.columns:
dft[col] = dft[col].apply(pd.Series).values.tolist()
result_df = dft.T # transpose back
Basically, the .apply(pd.Series) makes multiple columns out of one list column, padding with nan
's as necessary. Since we want to pad row-wise, the transpose is necessary.