I have a dataframe that includes US company identifiers (Instrument) as well as the company's Name, ISIN and it's CIK number.
Here is an example of my dataset:
dict = { "Instrument": ["4295914485", "4295913199", "4295904693", "5039191995", "5039191995"],
"Company Name":["Orrstown Financial Services Inc", "Ditech Networks Inc", "Penn Treaty American Corp", "Verb Technology Company Inc", np.nan],
"CIK" : ["826154", "1080667", "814181", "1566610", "1622355"],
"ISIN" : ["US6873801053", "US25500T1088", "US7078744007", "US92337U1043", np.nan]
}
df = pd.DataFrame(data=dict)
df
In some cases, there is more than one entry for each Instrument as can bee seen at the Instrument 5039191995. In those cases however, most of the time, there is one entry that is "superior" to the other ones in terms of information content.
For example, in the first of the two entries for Instrument 5039191995 there is no information missing, while in the second entry the Company name as well as the ISIN are missing. In this case I only would like to keep the first entry and drop the second one.
Overall Goal: For each entry that has duplicates in terms of the Instrument column, I only want to keep the entry that has the least missing values. If there are duplicates that have the same amount of missing values, all of those should be kept.
CodePudding user response:
You could use the number of null values in a row as a sort key, and keep the first (lowest) of each Instrument
import pandas as pd
import numpy as np
dict = { "Instrument": ["4295914485", "4295913199", "4295904693", "5039191995", "5039191995"],
"Company Name":["Orrstown Financial Services Inc", "Ditech Networks Inc", "Penn Treaty American Corp", "Verb Technology Company Inc", np.nan],
"CIK" : ["826154", "1080667", "814181", "1566610", "1622355"],
"ISIN" : ["US6873801053", "US25500T1088", "US7078744007", "US92337U1043", np.nan]
}
df = pd.DataFrame(data=dict)
df.assign(missing=df.isnull().sum(1)).sort_values(by='missing', ascending=True).drop_duplicates(subset='Instrument', keep='first').drop(columns='missing')
Output
Instrument Company Name CIK ISIN
0 4295914485 Orrstown Financial Services Inc 826154 US6873801053
1 4295913199 Ditech Networks Inc 1080667 US25500T1088
2 4295904693 Penn Treaty American Corp 814181 US7078744007
3 5039191995 Verb Technology Company Inc 1566610 US92337U1043