I have a data frame with months (by year), and ID number. I am trying to calculate the attrition rate, but I am getting stuck on obtaining unique ID counts when a month equals a certain month in pandas.
ID. | Month |
---|---|
1 | Sept. 2022 |
2 | Oct. 2022 |
etc... with possible duplicates in ID and 1.75 years worth of data.
import pandas as pd
path = some path on my computer
data = pd.read_excel(path)
if data["Month"] == "Sept. 2022":
ID_SEPT = data["ID."].unique()
return ID_SEPT
I am trying to discover what I am doing incorrect here in this if-then statement. Ideally I am trying to collect all the unique ID values per each month per year to then calculate the attrition rate. Is there something obvious I am doing wrong here?
Thank you.
I tried an id-then statement and I was expecting unique value counts of ID per month.
CodePudding user response:
The way you do this with a dataframe, conceptually, is to filter the entire dataframe to be just the rows where your comparison is true, and then do whatever (get uniques) from there.
That would look like this:
filtered_df = df[df['Month'] == 'Sept. 2022']
ids_sept = list(filtered_df['ID.'].unique()
The first line there can look at a little strange, but what it is doing is:
df['Month'] == 'Sept. 2022'
will return an array/column/series (it actually returns a series) ofTrue/False
whether or not the comparison is, well, true or false.- You then run that series of bools through
df[series_of_bools]
that filters the dataframe to return only the rows where it is True.
Thus, you have a filter.
If you are looking for the number of unique items, rather than the list of unique items, you can also use filtered_df['ID.'].nunique()
and save yourself the step later of getting the length of the list.
CodePudding user response:
You are looking for pandas.groupby.
Use it like this to get the unique values of each Group (Month)
data.groupby("Month")["ID."].unique() # You have a . after ID in your example, check if thats correct
CodePudding user response:
You need to use one of the iterator functions, like items().
for (columnName, columnData) in data.iteritems():
if columnName = 'Month'
[code]
CodePudding user response:
try this
data[data.Month=='Sept. 2022']['ID.'].unique()