Home > Enterprise >  How to parse excel data into dataframe with particular format?
How to parse excel data into dataframe with particular format?

Time:10-15

I have an excel like this, in the 'Elements' sheet of the excel like below: https://i.stack.imgur.com/pT0PY.png

Item              Category  Value
TRANSPORT         A         1
Bus               A         2
Car               A         3
Automobile        A         4
Bike              A         5
ACCOMODATION      A         6
House             A         7
Apartment         A         8
DELIVERY          B         9
Glovo             B         10
Emag              B         11
Transporter       B         12
ACCOMODATION      B         13
Apartment1        B         14
Apartment2        B         15
ACCOMODATION      C         16
Rental            C         17
Apartment         C         18

I want to separate the items(TRANSPORT, ACCOMODATION, DELIVERY) from the elements(Bus, Car, Automobile, Bike...) into a dataframe like below:

      Element          Item Category  Value
0         Bus     TRANSPORT        A      2
1         Car     TRANSPORT        A      3
2  Automobile     TRANSPORT        A      4
3        Bike     TRANSPORT        A      5
4       House  ACCOMODATION        A      7
5   Apartment  ACCOMODATION        A      8

I have managed to write the code to separate the elements from the category A, but somehow the code breaks when I use it for the category B or C or others. It triggers:IndexError: index 8 is out of bounds for axis 0 with size 7 because of the trimming of indexes in the code. I need to extract the values for the Value column only for the elements, not for the items and it breaks because of the length mismatch. I would need the final dataframe to contain all the information for all the categories from the Excel, not just for one category. What I tried so far (working for category A only):

import pandas as pd
import numpy as np

df = pd.read_excel('elements.xlsx',
                   ['Elements'], engine='openpyxl')

category_names = df['Elements']['Category'].unique()

df['Elements'] = df['Elements'].groupby(['Category'])

categ_group = ['TRANSPORT', 'ACCOMODATION', 'DELIVERY']


def create_category_df(category_name='A'):
    helper_df = df['Elements'].get_group(category_name)

    # get index for items
    item_index = helper_df[helper_df["Item"].isin(categ_group)].index.to_list()

    # get elements and associated items
    item_data = np.split(helper_df['Item'].to_numpy(), item_index)

    helper_df = helper_df.drop(helper_df.index[item_index])  # drop rows for items

    helper_df = helper_df.reset_index(drop=True)

    resulted_df = pd.DataFrame(columns=['Element', 'Item', 'Category', 'Value'])

    item_list = []
    for index in range(len(item_data)):
        if item_data[index].size != 0:
            resulted_df = resulted_df.append(pd.DataFrame(item_data[index][1:], columns=['Element']))
            item_list  = len(item_data[index][1:]) * [
                item_data[index][0]]  # multiply items by number of times it is present and add it to df

    resulted_df['Category'] = category_name  # 'Hardware EA'
    resulted_df['Item'] = item_list
    resulted_df['Value'] = helper_df['Value'].values
    resulted_df = resulted_df.reset_index(drop=True)
    print(resulted_df.to_string())
    return resulted_df


create_category_df()

CodePudding user response:

First replace columns names, then repalce non matched values of list by NaNs in Series.where, so possible forward filling missing values in DataFrame.insert for second new column and last remove rows if equal values in both columns in boolean indexing:

categ_group = ['TRANSPORT', 'ACCOMODATION', 'DELIVERY']

df = df.rename(columns={'Item':'Element'})
df.insert(1, 'Item', df['Element'].where(df['Element'].isin(categ_group)).ffill())
df =df[ df['Element'].ne(df['Item'])]

print (df)
        Element          Item Category  Value
1           Bus     TRANSPORT        A      2
2           Car     TRANSPORT        A      3
3    Automobile     TRANSPORT        A      4
4          Bike     TRANSPORT        A      5
6         House  ACCOMODATION        A      7
7     Apartment  ACCOMODATION        A      8
9         Glovo      DELIVERY        B     10
10         Emag      DELIVERY        B     11
11  Transporter      DELIVERY        B     12
13   Apartment1  ACCOMODATION        B     14
14   Apartment2  ACCOMODATION        B     15
16       Rental  ACCOMODATION        C     17
17    Apartment  ACCOMODATION        C     18
  • Related