Home > front end >  How to generate label by sparse cumcount
How to generate label by sparse cumcount

Time:04-06

Here's my master dataset

Id     Data          Category      Code
1      tey           Airport       AIR_02
2      fg            Hospital      HEA_04
3      dffs          Airport       AIR_01
4      dsfs          Hospital      HEA_03      
5      fdsf          Airport       AIR_04

Here's the data I wanr to merge

Id     Data             Category
1      tetyer           Airport
2      fgdss            Hospital
3      dffsdsa          Airport
4      dsfsas           Hospital      
5      fdsfada          Airport

My Expected Output

Id     Data          Category      Code
1      tey           Airport       AIR_02
2      fg            Hospital      HEA_04
3      dffs          Airport       AIR_01
4      dsfs          Hospital      HEA_03      
5      fdsf          Airport       AIR_04
6      tetyer        Airport       AIR_03
7      fgdss         Hospital      HEA_01
8      dffsdsa       Airport       AIR_05
9      dsfsas        Hospital      HEA_02
10      fdsfada      Airport       AIR_06

Note: HEA_01is not avaliable on existing dataset, Every Hospital Code start with HEA_ and Every airport start with AIR_, code 01,02 etc is by avaliability

CodePudding user response:

Use:

#split Code by _
df1[['a','b']] = df1['Code'].str.split('_', expand=True)
#converting values to integers
df1['b'] = df1['b'].astype(int)
#aggregate for list and first value for mapping
df11 = df1.groupby(['Category']).agg({'a':'first', 'b':list})

#get difference by np.arange with used values
def f(x):
    L = df11['b'][x.name]
    a = np.arange(1, len(x)   len(L)   1)
    #difference with filter same number of values like length of group
    return np.setdiff1d(a, L)[:len(x)]


df2['Code'] = df2.groupby('Category')['Category'].transform(f)
#created Code with join
df2['Code'] = df2['Category'].map(df11['a'])   '_'   df2['Code'].astype(str).str.zfill(2)
print (df2)
   Id     Data  Category    Code
0   1   tetyer   Airport  AIR_03
1   2    fgdss  Hospital  HEA_01
2   3  dffsdsa   Airport  AIR_05
3   4   dsfsas  Hospital  HEA_02
4   5  fdsfada   Airport  AIR_06

df = pd.concat([df1.drop(['a','b'], 1), df2], ignore_index=True)
print (df)
   Id     Data  Category    Code
0   1      tey   Airport  AIR_02
1   2       fg  Hospital  HEA_04
2   3     dffs   Airport  AIR_01
3   4     dsfs  Hospital  HEA_03
4   5     fdsf   Airport  AIR_04
5   1   tetyer   Airport  AIR_03
6   2    fgdss  Hospital  HEA_01
7   3  dffsdsa   Airport  AIR_05
8   4   dsfsas  Hospital  HEA_02
9   5  fdsfada   Airport  AIR_06

CodePudding user response:

To solve this, I would define a class to act as a code filler. The advantage of this approach is that you can then easily add more data without needing to recompute everything:

class CodeFiller():
    def __init__(self, df, col='Code', maps=None):
        codes = df[col].str.split('_', expand=True).groupby(0)[1].agg(set).to_dict()
        self.maps = maps
        self.gens = {prefix: self.code_gen(prefix, codes[prefix]) for prefix in codes}
        
    def code_gen(self, prefix, codes):
        from itertools import count
        for i in count(1):
            num = f'{i:02}'
            if num not in codes:
                yield f'{prefix}_{num}'
    
    def __call__(self, prefix):
        if self.maps:
            prefix = self.maps[prefix]
        return next(self.gens[prefix])

refs = {'Airport': 'AIR', 'Hospital': 'HEA'}
filler = CodeFiller(df1, maps=refs)

df3 = pd.concat([df1, df2.assign(Code=df2['Category'].map(filler))], ignore_index=True)

output:

    Id     Data  Category    Code
0    1      tey   Airport  AIR_02
1    2       fg  Hospital  HEA_04
2    3     dffs   Airport  AIR_01
3    4     dsfs  Hospital  HEA_03
4    5     fdsf   Airport  AIR_04
5    1   tetyer   Airport  AIR_03
6    2    fgdss  Hospital  HEA_01
7    3  dffsdsa   Airport  AIR_05
8    4   dsfsas  Hospital  HEA_02
9    5  fdsfada   Airport  AIR_06

Now imagine you have more data coming, you can just continue (reusing df2 here for the example):

pd.concat([df3, df2.assign(Code=df2['Category'].map(filler))], ignore_index=True)

output:

    Id     Data  Category    Code
[...]
10   1   tetyer   Airport  AIR_10
11   2    fgdss  Hospital  HEA_07
12   3  dffsdsa   Airport  AIR_11
13   4   dsfsas  Hospital  HEA_08
14   5  fdsfada   Airport  AIR_12
  • Related