what's up?
I am having a little problem, where I need to use the pandas dropna function to remove rows from my dataframe. However, I need it to not delete the unique values from my dataframe.
Let me explain better. I have the following dataframe:
id | birthday |
---|---|
0102-2 | 09/03/2020 |
0103-2 | 14/03/2020 |
0104-2 | NaN |
0105-2 | NaN |
0105-2 | 25/03/2020 |
0108-2 | 07/04/2020 |
In the case above, I need to delete the row from my dataframe based on the NaN values in the birthday column. However, as you can see the id "0104-2" is unique unlike the id "0105-2" where it has a NaN value and another with a date. So I would like to keep track of all the lines that have NaN that are unique.
Is it feasible to do this with dropna, or would I have to pre-process the information beforehand?
CodePudding user response:
You could sort by the birthday column and then drop duplicates keeping the first out of the two, by doing the following:
The complete code would look like this:
import pandas as pd
import numpy as np
data = {
"id": ['102-2','103-2','104-2', '105-2', '105-2', '108-2'],
"birthday":['09/03/2020', '14/03/2020', np.nan, np.nan, '25/03/2020', '07/04/2020']
}
df = pd.DataFrame(data)
df.sort_values(['birthday'], inplace=True)
df.drop_duplicates(subset="id", keep='first', inplace=True)
df.sort_values(['id'], inplace=True)
CODE EXPLANATION: Here is the original dataframe:
import pandas as pd
import numpy as np
data = {
"id": ['102-2','103-2','104-2', '105-2', '105-2', '108-2'],
"birthday":['09/03/2020', '14/03/2020', np.nan, np.nan, '25/03/2020', '07/04/2020']
}
df = pd.DataFrame(data)
Now sort the dataframe:
df.sort_values(['birthday'], inplace=True)
Then drop the duplicates based on the id
column. Keeping only the first value.
df.drop_duplicates(subset="id", keep='first', inplace=True)
CodePudding user response:
Sorry I read your requirement wrongly.. Should sort first then using shift()
df.sort_values(by='id', na_position='first', inplace=True)
df.loc[(df['birthday'].notna()) | (df['id'].shift(-1) != df['id'])]
Original Answer:
You can filter using notna() or duplicated()
df.loc[(df['birthday'].notna()) | (df.duplicated(subset='id', keep=False))]
CodePudding user response:
I could do this by grouping by id
and aggregating with .agg('first')
:
# added a couple more rows to show the behaviour
df = pd.DataFrame({'id': ['0102-2', '0102-2', '0103-2', '0103-2', '0104-2', '0105-2', '0105-2', '0108-2'],
'birthday': ['09/03/2020', np.nan, '14/03/2020', '12/03/2020', np.nan, np.nan, '25/03/2020', '07/04/2020']})
df
# Output:
id birthday
0 0102-2 09/03/2020
1 0102-2 NaN # duplicate id with NaN goes second
2 0103-2 14/03/2020
3 0103-2 12/03/2020 # duplicate id with an earlier date
4 0104-2 NaN # unique with NaN
5 0105-2 NaN # duplicate with NaN goes first
6 0105-2 25/03/2020
7 0108-2 07/04/2020
df.groupby('id').agg('first').reset_index()
# Output:
id birthday
0 0102-2 09/03/2020
1 0103-2 14/03/2020
2 0104-2 None
3 0105-2 25/03/2020
4 0108-2 07/04/2020
CodePudding user response:
I hope this is what you wanted:
import numpy as np
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3, 3, 4], 'birthday': [1994, np.nan, np.nan, 2019, 2020]})
df["id_count"] = df.groupby('id')['id'].transform('count') # helper column to identify unique ids
df_filtered = df[(df["id_count"] > 1) & df["birthday"].isna() == False] # filter out nans within non-unique ids
df_filtered.drop(["id_count"], axis=1) # drop helper column
output
id birthday
0 1 1994.0
1 2 NaN
3 3 2019.0
4 4 2020.0