I'm resetting the question to make it clearer to everyone regarding the DataFrame I have issues with. The DataFrame contains over 100,000 rows with over 10 columns. The important issue I have is filling in any values in the id
column that previously matches some value in type
column. The values in the id
column are not unique. Sometimes both id
and type
values are NIL.
Note: The order of things should must not be impacted as the records are dated with other columns. There are also no empty cells, all empty cells are filled with "NIL".
GOAL: To replace as many "NIL" as possible in the id
column only. I do not care for any "NIL" values in the type
column. But everything else must remain the same.
I am assuming this can be done without loops. Thank you for your help! Really appreciate it, strangers!
Current DataFrame
COL1 | id | type | COL4 | COL5 |
---|---|---|---|---|
NIL | 123 | moving | NIL | NIL |
... | 123 | lend | ... | ... |
NIL | 456 | penny | ... | NIL |
... | NIL | appendix | NIL | ... |
... | 251 | appendix | ... | ... |
NIL | NIL | lend | ... | NIL |
... | 665 | NIL | NIL | ... |
... | NIL | NIL | ... | ... |
... | 251 | retailer | NIL | NIL |
NIL | 251 | appendix | ... | NIL |
NIL | 456 | NIL | ... | NIL |
Expected DataFrame
COL1 | id | type | COL4 | COL5 |
---|---|---|---|---|
NIL | 123 | moving | NIL | NIL |
... | 123 | lend | ... | ... |
NIL | 456 | penny | ... | NIL |
... | 251 | appendix | NIL | ... |
... | 251 | appendix | ... | ... |
NIL | 123 | lend | ... | NIL |
... | 665 | NIL | NIL | ... |
... | NIL | NIL | ... | ... |
... | 251 | retailer | NIL | NIL |
NIL | 251 | appendix | ... | NIL |
NIL | 456 | NIL | ... | NIL |
CodePudding user response:
You can use a merge
# assemble the dataframe
In [4]: a
Out[4]:
[26235,
'fruit',
26235,
'veggie',
32256,
'NIL',
'NIL',
'pawn',
12415,
'NIL',
'NIL',
'fruit',
'NIL',
'veggie',
24145,
'pawn',
'NIL',
'NIL']
In [5]: df = pandas.DataFrame({'id':a[::2], 'type':a[1::2]})
# id type
#0 26235 fruit
#1 26235 veggie
#2 32256 NIL
#3 NIL pawn
#4 12415 NIL
#5 NIL fruit
#6 NIL veggie
#7 24145 pawn
#8 NIL NIL
In [6]: id_is_nill = df.id=="NIL"
In [7]: left = df.loc[id_is_nill].reset_index()
In [8]: right = df.loc[~id_is_nill].reset_index()
In [9]: mrg = pandas.merge(left, right, on='type').query("type != 'NIL'")
In [10]: mrg
Out[10]:
index_x id_x type index_y id_y
0 3 NIL pawn 7 24145
1 5 NIL fruit 0 26235
2 6 NIL veggie 1 26235
In [11]: df.loc[mrg.index_x, 'id'] = mrg.id_y.values
In [12]: df
Out[12]:
id type
0 26235 fruit
1 26235 veggie
2 32256 NIL
3 24145 pawn
4 12415 NIL
5 26235 fruit
6 26235 veggie
7 24145 pawn
8 NIL NIL
CodePudding user response:
here is one way to do it
create a dictionary of type and id, and then using loc to fill in the missing values
d=df[df['id'].ne('NIL')].drop_duplicates(subset=['type']).to_dict()
df.loc[df['id'].eq('NIL'), 'id'] = df['type'].map(d)
df
id type
0 26235 fruit
1 26235 veggie
2 32256 NIL
3 24145 pawn
4 12415 NIL
5 26235 fruit
6 26235 veggie
7 24145 pawn
CodePudding user response:
You can do:
#d = df.loc[(df.id != 'NIL') & (df.type != 'NIL')].set_index('type')['id']
d = df[df!='NIL'].dropna().set_index('type')['id']
df.loc[df.id =='NIL', 'id'] = df.loc[df.id == 'NIL', 'type'].map(d)
print(df):
id type
0 26235 fruit
1 26235 veggie
2 32256 NIL
3 24145 pawn
4 12415 NIL
5 26235 fruit
6 26235 veggie
7 24145 pawn