I have 12 files:
files = ['01_2021.csv', '02_2021.csv', '03_2021.csv', '04_2021.csv', '05_2021.csv', '06_2021.csv', '07_2021.csv', '08_2021.csv', '09_2021.csv', '10_2021.csv', '11_2020.csv', '12_2020.csv']
My CSV file structure:
id itemName NonImportantEntries Entries SomeOtherEntries
1 item1 27 111 163
2 item2 16 22 98
...
5000
I'm trying to calculate % decrease/increase of latest month's file (in this case 10_2021) value in "Entries" to the previous month's value in "Entries" per each unique id. Also, consider that it's not guaranteed that unique ID will always be present in both files.
10_2021.csv:
id itemName NonImportantEntries Entries SomeOtherEntries
1 item1 27 111 163
2 item2 16 22 98
...
5000
09_2021.csv:
id itemName NonImportantEntries Entries SomeOtherEntries
1 item1 27 97 163
2 item2 16 57 98
...
5000
for example with id=1:
111(10_2021.csv) - 97 (09_2021.csv) = 14
14 / 97 (09_2021.csv) = 0.1443 * 100 = 14.43
for example with id=2:
22(10_2021.csv) - 57 (09_2021.csv) = -35
-35 / 57 (09_2021.csv) = -0.6140 * 100 = -61.40
Desired output is:
id %differenceLatestMonthToPreviousMonth
1 14.43%
2 -61.40%
My Code so far:
import pandas as pd
from os import listdir
from os.path import isfile, join
#readMyDirectoryForFiles
mypath= <myDirectoryPath>
list_of_files = [f for f in listdir(mypath) if isfile(join(mypath, f))]
#GenerateAllFilesInDirList
list_of_files = [mypath x for x in list_of_files]
#sortListToEnsureLatestTwoMonthsAreOnTop
list_of_files.sort()
#ConsiderOnlyLatestTwoMonths
filesNeeded = list_of_files[:2]
#I'm stuck here: map file names to each unique ID and calculate like on examples provided above for id 1 and id 2.
dataframes = [pd.read_csv(fi) for fi in filesNeeded]
Could someone help with this? thank you in advance.
CodePudding user response:
It sounds like you need to pd.DataFrame.join
your two DataFrame
s on your id
column, and then calculate the % difference:
this_month = dataframes[0]
last_month = dataframes[1]
combined = this_month.join(last_month.set_index('id'),
on='id',
lsuffix='_this_month',
rsuffix='_last_month',
)
combined['pct_diff_between_months'] = \
((combined['Entries_this_month']/combined['Entries_last_month'] - 1)*100)