Here's my master dataset
Id Data Category Code
1 tey Airport AIR_02
2 fg Hospital HEA_04
3 dffs Airport AIR_01
4 dsfs Hospital HEA_03
5 fdsf Airport AIR_04
Here's the data I wanr to merge
Id Data Category
1 tetyer Airport
2 fgdss Hospital
3 dffsdsa Airport
4 dsfsas Hospital
5 fdsfada Airport
My Expected Output
Id Data Category Code
1 tey Airport AIR_02
2 fg Hospital HEA_04
3 dffs Airport AIR_01
4 dsfs Hospital HEA_03
5 fdsf Airport AIR_04
6 tetyer Airport AIR_03
7 fgdss Hospital HEA_01
8 dffsdsa Airport AIR_05
9 dsfsas Hospital HEA_02
10 fdsfada Airport AIR_06
Note:
HEA_01
is not avaliable on existing dataset, Every Hospital Code start with HEA_
and Every airport start with AIR_
, code 01
,02
etc is by avaliability
CodePudding user response:
Use:
#split Code by _
df1[['a','b']] = df1['Code'].str.split('_', expand=True)
#converting values to integers
df1['b'] = df1['b'].astype(int)
#aggregate for list and first value for mapping
df11 = df1.groupby(['Category']).agg({'a':'first', 'b':list})
#get difference by np.arange with used values
def f(x):
L = df11['b'][x.name]
a = np.arange(1, len(x) len(L) 1)
#difference with filter same number of values like length of group
return np.setdiff1d(a, L)[:len(x)]
df2['Code'] = df2.groupby('Category')['Category'].transform(f)
#created Code with join
df2['Code'] = df2['Category'].map(df11['a']) '_' df2['Code'].astype(str).str.zfill(2)
print (df2)
Id Data Category Code
0 1 tetyer Airport AIR_03
1 2 fgdss Hospital HEA_01
2 3 dffsdsa Airport AIR_05
3 4 dsfsas Hospital HEA_02
4 5 fdsfada Airport AIR_06
df = pd.concat([df1.drop(['a','b'], 1), df2], ignore_index=True)
print (df)
Id Data Category Code
0 1 tey Airport AIR_02
1 2 fg Hospital HEA_04
2 3 dffs Airport AIR_01
3 4 dsfs Hospital HEA_03
4 5 fdsf Airport AIR_04
5 1 tetyer Airport AIR_03
6 2 fgdss Hospital HEA_01
7 3 dffsdsa Airport AIR_05
8 4 dsfsas Hospital HEA_02
9 5 fdsfada Airport AIR_06
CodePudding user response:
To solve this, I would define a class to act as a code filler. The advantage of this approach is that you can then easily add more data without needing to recompute everything:
class CodeFiller():
def __init__(self, df, col='Code', maps=None):
codes = df[col].str.split('_', expand=True).groupby(0)[1].agg(set).to_dict()
self.maps = maps
self.gens = {prefix: self.code_gen(prefix, codes[prefix]) for prefix in codes}
def code_gen(self, prefix, codes):
from itertools import count
for i in count(1):
num = f'{i:02}'
if num not in codes:
yield f'{prefix}_{num}'
def __call__(self, prefix):
if self.maps:
prefix = self.maps[prefix]
return next(self.gens[prefix])
refs = {'Airport': 'AIR', 'Hospital': 'HEA'}
filler = CodeFiller(df1, maps=refs)
df3 = pd.concat([df1, df2.assign(Code=df2['Category'].map(filler))], ignore_index=True)
output:
Id Data Category Code
0 1 tey Airport AIR_02
1 2 fg Hospital HEA_04
2 3 dffs Airport AIR_01
3 4 dsfs Hospital HEA_03
4 5 fdsf Airport AIR_04
5 1 tetyer Airport AIR_03
6 2 fgdss Hospital HEA_01
7 3 dffsdsa Airport AIR_05
8 4 dsfsas Hospital HEA_02
9 5 fdsfada Airport AIR_06
Now imagine you have more data coming, you can just continue (reusing df2 here for the example):
pd.concat([df3, df2.assign(Code=df2['Category'].map(filler))], ignore_index=True)
output:
Id Data Category Code
[...]
10 1 tetyer Airport AIR_10
11 2 fgdss Hospital HEA_07
12 3 dffsdsa Airport AIR_11
13 4 dsfsas Hospital HEA_08
14 5 fdsfada Airport AIR_12