Home > Software engineering >  pandas link column using list item
pandas link column using list item

Time:05-06

I have two dataframes df and tf like as given below

df = [{"unique_key": 1, "test_ids": "1.0,15,2.0,nan"}, {"unique_key": 2, "test_ids": "51,75.0,11.0,NaN"},{"unique_key": 3, "test_ids":np.nan},
     {"unique_key": 4, "test_ids":np.nan}]
df = pd.DataFrame(df)

test_ids,status,revenue,cnt_days     
1,passed,234.54,3          
2,passed,543.21,5
11,failed,21.3,4
15,failed,2098.21,6             
51,passed,232,21     
75,failed,123.87,32 

tf = pd.read_clipboard(sep=',')

I would like to link the unique_key column from df to the tf dataframe

For ex: I will show my output below (that's easy to understand than text)

I was trying something like below

for b in df.test_ids.tolist():
    for a in b.split(','):
        if a >= 0: # to exclude NA values from checking
            for i in len(test_ids):
              if int(a)  == tf['test_ids'][i]:
                   tf['unique_key'] = df['unique_key']
                 

But this is neither efficient nor elegant to solve my problem.

Is there any other better way to achieve the expected output shown below?

enter image description here

CodePudding user response:

You can create Series with remove duplicates and missing values, swap to dictioanry and for new first column use DataFrame.insert with Series.map:

s = (df.set_index('unique_key')['test_ids']
       .str.split(',')
       .explode()
       .astype(float)
       .dropna()
       .astype(int)
       .drop_duplicates()
d = {v: k for k, v in s.items()}
print (d)
{1: 1, 15: 1, 2: 1, 51: 2, 75: 2, 11: 2}

tf.insert(0, 'unique_key', tf['test_ids'].map(d))
print (tf)
   unique_key  test_ids  status  revenue  cnt_days
0           1         1  passed   234.54         3
1           1         2  passed   543.21         5
2           2        11  failed    21.30         4
3           1        15  failed  2098.21         6
4           2        51  passed   232.00        21
5           2        75  failed   123.87        32

Another idea is working with DataFrame and create Series for mapping:

s = (df.assign(new = df['test_ids'].str.split(','))
       .explode('new')
       .astype({'new':float})
       .dropna(subset=['new'])
       .astype({'new':int})
       .drop_duplicates(subset=['new'])
       .set_index('new')['unique_key'])

print (s)
new
1     1
15    1
2     1
51    2
75    2
11    2
Name: unique_key, dtype: int64

tf.insert(0, 'unique_key', tf['test_ids'].map(s))
print (tf)
   unique_key  test_ids  status  revenue  cnt_days
0           1         1  passed   234.54         3
1           1         2  passed   543.21         5
2           2        11  failed    21.30         4
3           1        15  failed  2098.21         6
4           2        51  passed   232.00        21
5           2        75  failed   123.87        32
  • Related