I have an excel like this, in the 'Elements' sheet of the excel like below: https://i.stack.imgur.com/pT0PY.png
Item Category Value
TRANSPORT A 1
Bus A 2
Car A 3
Automobile A 4
Bike A 5
ACCOMODATION A 6
House A 7
Apartment A 8
DELIVERY B 9
Glovo B 10
Emag B 11
Transporter B 12
ACCOMODATION B 13
Apartment1 B 14
Apartment2 B 15
ACCOMODATION C 16
Rental C 17
Apartment C 18
I want to separate the items(TRANSPORT, ACCOMODATION, DELIVERY) from the elements(Bus, Car, Automobile, Bike...) into a dataframe like below:
Element Item Category Value
0 Bus TRANSPORT A 2
1 Car TRANSPORT A 3
2 Automobile TRANSPORT A 4
3 Bike TRANSPORT A 5
4 House ACCOMODATION A 7
5 Apartment ACCOMODATION A 8
I have managed to write the code to separate the elements from the category A, but somehow the code breaks when I use it for the category B or C or others. It triggers:IndexError: index 8 is out of bounds for axis 0 with size 7
because of the trimming of indexes in the code. I need to extract the values for the Value column only for the elements, not for the items and it breaks because of the length mismatch.
I would need the final dataframe to contain all the information for all the categories from the Excel, not just for one category.
What I tried so far (working for category A only):
import pandas as pd
import numpy as np
df = pd.read_excel('elements.xlsx',
['Elements'], engine='openpyxl')
category_names = df['Elements']['Category'].unique()
df['Elements'] = df['Elements'].groupby(['Category'])
categ_group = ['TRANSPORT', 'ACCOMODATION', 'DELIVERY']
def create_category_df(category_name='A'):
helper_df = df['Elements'].get_group(category_name)
# get index for items
item_index = helper_df[helper_df["Item"].isin(categ_group)].index.to_list()
# get elements and associated items
item_data = np.split(helper_df['Item'].to_numpy(), item_index)
helper_df = helper_df.drop(helper_df.index[item_index]) # drop rows for items
helper_df = helper_df.reset_index(drop=True)
resulted_df = pd.DataFrame(columns=['Element', 'Item', 'Category', 'Value'])
item_list = []
for index in range(len(item_data)):
if item_data[index].size != 0:
resulted_df = resulted_df.append(pd.DataFrame(item_data[index][1:], columns=['Element']))
item_list = len(item_data[index][1:]) * [
item_data[index][0]] # multiply items by number of times it is present and add it to df
resulted_df['Category'] = category_name # 'Hardware EA'
resulted_df['Item'] = item_list
resulted_df['Value'] = helper_df['Value'].values
resulted_df = resulted_df.reset_index(drop=True)
print(resulted_df.to_string())
return resulted_df
create_category_df()
CodePudding user response:
First replace columns names, then repalce non matched values of list by NaN
s in Series.where
, so possible forward filling missing values in DataFrame.insert
for second new column and last remove rows if equal values in both columns in boolean indexing
:
categ_group = ['TRANSPORT', 'ACCOMODATION', 'DELIVERY']
df = df.rename(columns={'Item':'Element'})
df.insert(1, 'Item', df['Element'].where(df['Element'].isin(categ_group)).ffill())
df =df[ df['Element'].ne(df['Item'])]
print (df)
Element Item Category Value
1 Bus TRANSPORT A 2
2 Car TRANSPORT A 3
3 Automobile TRANSPORT A 4
4 Bike TRANSPORT A 5
6 House ACCOMODATION A 7
7 Apartment ACCOMODATION A 8
9 Glovo DELIVERY B 10
10 Emag DELIVERY B 11
11 Transporter DELIVERY B 12
13 Apartment1 ACCOMODATION B 14
14 Apartment2 ACCOMODATION B 15
16 Rental ACCOMODATION C 17
17 Apartment ACCOMODATION C 18