I have a dataset of prices for various types of fruits in various quarters for several years. I want to filter a column based on the values of multiple other columns. The dataframe looks like this:
First column contains years. Column 2 can have a mix of 4 quarters Q1, Q2, Q3, and Q4. Column 3 contains the names of the foods. Column 4 contains the type/variety of the foods from column 3. Finally, column 5 contains the price per unit.
I want to consider only those foods that have price data for all 4 quarters every year from 2015-2017. So, I will only take Apples, Oranges and Melons and I ignore Guava. Guava does not have price data for all quarters and all years from 2015-2017.
My desired dataframe will have 13 columns as below. First column will list the foods satisfying the above criteria. Rest of the columns will indicate the prices in the specific quarter and year. There is one subtlety.
For a given food, if there are repeated quarters and repeated years, simultaneosly, then that means that there are more than 1 type of food. E.g., For Apples in column 3, there are repeated Q3 and repeated 2017, simultaneously. This is because there are two types of apple - apple1 and apple2 - in the data for that quarter in that year. In such cases, I want to take their arithmetic average (sum/freq) of prices in the dataframe I need. So, for price for Apples in Q3_2017 would be (112.1 112.5)/2=112.3.
My desired dataframe is
Foods Q1_2015 Q2_2015 Q3_2015 Q4_2015 Q1_2016 Q2_2016 Q3_2016 Q4_2016 Q1_2017 Q2_2017 Q3_2017 Q4_2017
Apples
Oranges
Melons
Because of the need to manipulate several columns, I am not able to use groupby method. How can I get the dataframe as above? Any help is much appreciated.
CodePudding user response:
s ="""Year Quarter Foods Types Prices
2015 Q1 Apples Apple1 107.7
2015 Q2 Apples Apple1 109.4
2015 Q3 Apples Apple1 110.2
2015 Q4 Apples Apple1 109.6
2016 Q1 Apples Apple1 109.3
2016 Q2 Apples Apple1 108.7
2016 Q3 Apples Apple1 112.6
2016 Q4 Apples Apple1 110.4
2017 Q1 Apples Apple1 111.8
2017 Q2 Apples Apple1 110.7
2017 Q3 Apples Apple1 112.1
2017 Q3 Apples Apple2 112.5
2017 Q4 Apples Apple1 112.0
2015 Q1 Oranges Orange1 89.9
2015 Q2 Oranges Orange1 90.5
2015 Q3 Oranges Orange1 89.6
2015 Q4 Oranges Orange1 89.8
2016 Q1 Oranges Orange1 88.8
2016 Q2 Oranges Orange1 89.9
2016 Q3 Oranges Orange1 90.5
2016 Q4 Oranges Orange1 89.9
2017 Q1 Oranges Orange1 90.0
2017 Q2 Oranges Orange1 89.1
2017 Q3 Oranges Orange1 89.2
2017 Q4 Oranges Orange1 89.5
2015 Q1 Melons Melons 1 94.7
2015 Q2 Melons Melons 1 94.6
2015 Q3 Melons Melons 1 96.4
2015 Q4 Melons Melons 1 95.3
2016 Q1 Melons Melons 1 94.1
2016 Q2 Melons Melons 1 93.0
2016 Q2 Melons Melons 2 105.2
2016 Q3 Melons Melons 1 97.2
2016 Q4 Melons Melons 1 94.9
2017 Q1 Melons Melons 1 97.1
2017 Q2 Melons Melons 1 96.5
2017 Q3 Melons Melons 1 95.7
2017 Q4 Melons Melons 1 96.9
2015 Q1 Guava Guava1 100.2"""
df = pd.DataFrame([x.split('\t') for x in s.split('\n')])
df = df.rename(columns=df.iloc[0]).drop(df.index[0]).reset_index(drop=True)
df["Prices"] = pd.to_numeric(df["Prices"]) # ensuring prices are numeric
I am using the years as a string here, as there are no DateTime-specific operations for the question. You can adapt your code according to your requirement.
fy= ['2015', '2016', '2017'] # list of years interested in
ar = []
for name, group in df.groupby('Foods'):
if list(group['Year'].unique()) ==fy: # compare years with your year list
# aggreagrate multiple price enteries
temp = group.groupby(['Year','Foods','Quarter'], as_index=False)['Prices'].mean()
ar.append(temp)
df_temp = pd.concat(ar).reset_index(drop=True) # create a new dataframe
# restructure the dataframe with food as the index, and year and quater as columns
df_temp = df_temp.pivot(index='Foods',columns=['Year','Quarter'],values='Prices')
# format the column names
df_temp.columns = [f'{y}_{x}' for x,y in df_temp.columns] # x is year and y is quater
df_temp = df_temp.reset_index()
df_temp
Foods Q1_2015 Q2_2015 Q3_2015 Q4_2015 Q1_2016 Q2_2016 Q3_2016 Q4_2016 Q1_2017 Q2_2017 Q3_2017 Q4_2017
0 Apples 107.7 109.4 110.2 109.6 109.3 108.7 112.6 110.4 111.8 110.7 112.3 112.0
1 Melons 94.7 94.6 96.4 95.3 94.1 99.1 97.2 94.9 97.1 96.5 95.7 96.9
2 Oranges 89.9 90.5 89.6 89.8 88.8 89.9 90.5 89.9 90.0 89.1 89.2 89.5
CodePudding user response:
Use GroupBy.transform
with DataFrameGroupBy.nunique
per Year and Foods for only groups with 4 unique values in Quarter
and for Year
s exist in list by Series.isin
, then filter in boolean indexing
Last pivoting by DataFrame.pivot_table
with flatten MultiIndex
:
years = [2015,2016,2017]
mask = (df.groupby(['Foods','Year'])['Quarter'].transform('nunique').eq(4) &
df['Year'].isin(years))
df = df[mask]
#pivoting and aggregate mean
df1=df.pivot_table(index='Foods',columns=['Year','Quarter'],values='Prices',aggfunc='mean')
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df1)
2015_Q1 2015_Q2 2015_Q3 2015_Q4 2016_Q1 2016_Q2 2016_Q3 \
Foods
Apples 107.7 109.4 110.2 109.6 109.3 108.7 112.6
Melons 94.7 94.6 96.4 95.3 94.1 99.1 97.2
Oranges 89.9 90.5 89.6 89.8 88.8 89.9 90.5
2016_Q4 2017_Q1 2017_Q2 2017_Q3 2017_Q4
Foods
Apples 110.4 111.8 110.7 112.3 112.0
Melons 94.9 97.1 96.5 95.7 96.9
Oranges 89.9 90.0 89.1 89.2 89.5