Home > Mobile >  Elegant wat to join dataframe by name then autofill blank value by pandas, create a coloured scatter
Elegant wat to join dataframe by name then autofill blank value by pandas, create a coloured scatter

Time:12-05

I just started learning about pandas a few days ago. I am wondering is there an elegant and efficient way to achieve what my title as stated with the following data frames.

import pandas as pd

data1 = pd.DataFrame([['ad_001','50'], ['ad_002', '100'], ['ad_003', '150']],columns=['name', 'score'])
  
data2 = pd.DataFrame([['ad_001','75'], ['ad_002', '200'], ['ad_004', '100']],columns=['name', 'score'])

i tried using

data1.merge(data2, how='left', left_on='name', right_on='name')

to merge the two dataframes. My aim is to use python codes elegantly join them and also auto-fill the missing values (which i dont know how to elegantly perform) in both scores like the followings:

data1 = pd.DataFrame([['ad_001','50','75'], ['ad_002', '100', '200'], ['ad_003', '150', '0'], ['ad_004', '0', '100']],columns=['name', 'score_x','score_y'])

Subsequently i can then turn this combined dataframe into a scatter plot plotting score_x against score_y using

import matplotlib.pyplot as plt

and colour each point according to the maximum score of x and y, i.e. if x or y >100, colour red; if x or y >150, colour green; if x or y >200, colour red. I tried looking at [https://matplotlib.org/stable/gallery/lines_bars_and_markers/scatter_with_legend.html#sphx-glr-gallery-lines-bars-and-markers-scatter-with-legend-py] the userguide but do quite know how to implement it using python.I am wondering is it easier to create a new column with panda to define the value first or can matplotlib inherently does that? Or is there any other scatter plotting python modules that one would recommend to achieve the same outcome.

I would appreciate it if any of the experienced senior coder can enlighten me on this. Thank you very much in advance. Your generosity in sharing your valuable knowledge and time is truly appreciated.

CodePudding user response:

For the first part of merging the two dataframes, one of the ways to do this is to use merge and use outer so that all columns are captured. This will include all rows with nan where no data is available. Using .fillna(0) will handle this - based on how you mentioned you want the invalid number to appear.

For the conditions and plotting, the simplest way would be to use something like np.where(), which you can use to identify the colors you want. As your question had red for two conditions, I have made one as red, while the other is blue. You can adjust the numbers and colors are you need. Once the column with colors is available, using groupby() and plotting will give you the results you need. Hope this helps...

import pandas as pd
import matplotlib.pyplot as plt
data1 = pd.DataFrame([['ad_001','50'], ['ad_002', '100'], ['ad_003', '150']],columns=['name', 'score'])
data2 = pd.DataFrame([['ad_001','75'], ['ad_002', '200'], ['ad_004', '100']],columns=['name', 'score'])

newdata=pd.merge(data1, data2, on="name", how='outer').fillna(0) ## Merge & fillna()
newdata['score_x']=newdata['score_x'].astype('int64') ## Convert to int as you are comparing
newdata['score_y']=newdata['score_y'].astype('int64') ## Convert to int as you are comparing

##Use np.where to create color column with the colors you need
newdata['color']=np.where(((newdata.score_x<100) & (newdata.score_y<100)), 'red', 
                       np.where(((newdata.score_x<150) & (newdata.score_y<150)), 'green', 'blue'))

## Group and plot
fig, ax = plt.subplots()
for clr, d in newdata.groupby('color'):
    ax.scatter(x=d['score_x'],y=d['score_x'], label=clr)

enter image description here

  • Related