Home > Enterprise >  Calculate % increase/decrease for a column value in two separate data frames per each unique ID
Calculate % increase/decrease for a column value in two separate data frames per each unique ID

Time:11-13

I have 12 files:

files = ['01_2021.csv', '02_2021.csv', '03_2021.csv', '04_2021.csv', '05_2021.csv', '06_2021.csv', '07_2021.csv', '08_2021.csv', '09_2021.csv', '10_2021.csv', '11_2020.csv', '12_2020.csv']

My CSV file structure:

id    itemName    NonImportantEntries    Entries    SomeOtherEntries
1      item1              27              111             163
2      item2              16               22              98
...
5000

I'm trying to calculate % decrease/increase of latest month's file (in this case 10_2021) value in "Entries" to the previous month's value in "Entries" per each unique id. Also, consider that it's not guaranteed that unique ID will always be present in both files.

10_2021.csv:

id    itemName    NonImportantEntries    Entries    SomeOtherEntries
1      item1              27              111             163
2      item2              16               22              98
...
5000

09_2021.csv:

id    itemName    NonImportantEntries    Entries    SomeOtherEntries
1      item1              27               97             163
2      item2              16               57              98
...
5000

for example with id=1:

111(10_2021.csv) - 97 (09_2021.csv) = 14
14 / 97 (09_2021.csv) = 0.1443 * 100 = 14.43

for example with id=2:

22(10_2021.csv) - 57 (09_2021.csv) = -35
-35 / 57 (09_2021.csv) = -0.6140 * 100 = -61.40

Desired output is:

id    %differenceLatestMonthToPreviousMonth
1                    14.43%
2                   -61.40%

My Code so far:

import pandas as pd

from os import listdir
from os.path import isfile, join

#readMyDirectoryForFiles
mypath= <myDirectoryPath>
   
list_of_files = [f for f in listdir(mypath) if isfile(join(mypath, f))]

#GenerateAllFilesInDirList
list_of_files = [mypath x for x in list_of_files]

#sortListToEnsureLatestTwoMonthsAreOnTop
list_of_files.sort()

#ConsiderOnlyLatestTwoMonths
filesNeeded = list_of_files[:2]

#I'm stuck here: map file names to each unique ID and calculate like on examples provided above for id 1 and id 2.

dataframes = [pd.read_csv(fi) for fi in filesNeeded]

Could someone help with this? thank you in advance.

CodePudding user response:

It sounds like you need to pd.DataFrame.join your two DataFrames on your id column, and then calculate the % difference:

this_month = dataframes[0]
last_month = dataframes[1]

combined = this_month.join(last_month.set_index('id'),
                           on='id',
                           lsuffix='_this_month',
                           rsuffix='_last_month',
                          )

combined['pct_diff_between_months'] = \
    ((combined['Entries_this_month']/combined['Entries_last_month'] - 1)*100)
  • Related