Keep entry with least missing values for a given observation in dataframe-CodePudding

I have a dataframe that includes US company identifiers (Instrument) as well as the company's Name, ISIN and it's CIK number.

Here is an example of my dataset:

dict = { "Instrument": ["4295914485", "4295913199", "4295904693", "5039191995", "5039191995"],
        "Company Name":["Orrstown Financial Services Inc", "Ditech Networks Inc", "Penn Treaty American Corp", "Verb Technology Company Inc", np.nan],
        "CIK" : ["826154", "1080667", "814181", "1566610", "1622355"],
        "ISIN" : ["US6873801053", "US25500T1088", "US7078744007", "US92337U1043", np.nan]
        }
df = pd.DataFrame(data=dict)
df

In some cases, there is more than one entry for each Instrument as can bee seen at the Instrument 5039191995. In those cases however, most of the time, there is one entry that is "superior" to the other ones in terms of information content.

For example, in the first of the two entries for Instrument 5039191995 there is no information missing, while in the second entry the Company name as well as the ISIN are missing. In this case I only would like to keep the first entry and drop the second one.

Overall Goal: For each entry that has duplicates in terms of the Instrument column, I only want to keep the entry that has the least missing values. If there are duplicates that have the same amount of missing values, all of those should be kept.

CodePudding user response：

You could use the number of null values in a row as a sort key, and keep the first (lowest) of each Instrument

import pandas as pd
import numpy as np
dict = { "Instrument": ["4295914485", "4295913199", "4295904693", "5039191995", "5039191995"],
        "Company Name":["Orrstown Financial Services Inc", "Ditech Networks Inc", "Penn Treaty American Corp", "Verb Technology Company Inc", np.nan],
        "CIK" : ["826154", "1080667", "814181", "1566610", "1622355"],
        "ISIN" : ["US6873801053", "US25500T1088", "US7078744007", "US92337U1043", np.nan]
        }
df = pd.DataFrame(data=dict)

df.assign(missing=df.isnull().sum(1)).sort_values(by='missing', ascending=True).drop_duplicates(subset='Instrument', keep='first').drop(columns='missing')

Output

   Instrument                     Company Name      CIK          ISIN
0  4295914485  Orrstown Financial Services Inc   826154  US6873801053
1  4295913199              Ditech Networks Inc  1080667  US25500T1088
2  4295904693        Penn Treaty American Corp   814181  US7078744007
3  5039191995      Verb Technology Company Inc  1566610  US92337U1043