I have a pandas dataframe that looks like this:
Index | Year | Month | Fruit | Count |
---|---|---|---|---|
0 | 2021 | 03 | Apple | 2 |
1 | 2021 | 03 | Orange | 3 |
2 | 2021 | 03 | Mango | 4 |
3 | 2021 | 04 | Apple | 1 |
4 | 2021 | 04 | Mango | 2 |
5 | 2021 | 05 | Apple | 1 |
Let's say that Apple, Orange and Mango are the only fruits I am interested in.
The dataframe is formed such that if the count for a fruit (in a month and year) is 0, it won't show up as a record. I want to add the missing fruit row to the dataframe with count column value = 0. In short, I want each of the 3 fruits to show up for the year and month irrespective of the count being 0. So I want it to look like this:
Index | Year | Month | Fruit | Count |
---|---|---|---|---|
0 | 2021 | 03 | Apple | 2 |
1 | 2021 | 03 | Orange | 3 |
2 | 2021 | 03 | Mango | 4 |
3 | 2021 | 04 | Apple | 1 |
x | 2021 | 04 | Orange | 0 |
4 | 2021 | 04 | Mango | 2 |
5 | 2021 | 05 | Apple | 1 |
x | 2021 | 05 | Orange | 0 |
x | 2021 | 05 | Mango | 0 |
Appreciate any suggestions on how to approach this.
CodePudding user response:
You can use pyjanitor
's complete
to fill the missing rows:
import io
import pandas as pd
import janitor
data = '''Index Year Month Fruit Count
0 2021 03 Apple 2
1 2021 03 Orange 3
2 2021 03 Mango 4
3 2021 04 Apple 1
4 2021 04 Mango 2
5 2021 05 Apple 1'''
df = pd.read_csv(io.StringIO(data), sep='\t', index_col='Index')
df = df.complete('Fruit', ['Year', 'Month']).fillna(0, downcast='infer').sort_values(['Year', 'Month', 'Fruit'])
Output:
Year | Month | Fruit | Count | |
---|---|---|---|---|
0 | 2021 | 3 | Apple | 2 |
2 | 2021 | 3 | Mango | 4 |
1 | 2021 | 3 | Orange | 3 |
3 | 2021 | 4 | Apple | 1 |
4 | 2021 | 4 | Mango | 2 |
6 | 2021 | 4 | Orange | 0 |
5 | 2021 | 5 | Apple | 1 |
8 | 2021 | 5 | Mango | 0 |
7 | 2021 | 5 | Orange | 0 |
CodePudding user response:
You can use Categorical
to add Fruit
as a category
Year = [2021, 2021, 2021, 2021, 2021, 2021]
Month = ["03","03", "03","04","04","05"]
Fruit = ["Apple", "Orange", "Mango", "Apple", "Mango", "Apple"]
Count = [2,3,4,1,2,1]
df = pd.DataFrame({"Year":Year, "Month":Month, "Fruit":Fruit, "Count":Count})
df['Fruit'] = pd.Categorical(df['Fruit'], categories=df['Fruit'].unique())
df = df.groupby(['Year','Month','Fruit'], as_index=False).first()
df['Count'] = df.Count.fillna(0).astype(int)
df
Output:
Year Month Fruit Count
0 2021 03 Apple 2
1 2021 03 Orange 3
2 2021 03 Mango 4
3 2021 04 Apple 1
4 2021 04 Orange 0
5 2021 04 Mango 2
6 2021 05 Apple 1
7 2021 05 Orange 0
8 2021 05 Mango 0