I'm trying to merge multiple year end results (2013 until 2021) which I read with pandas from excel - see those three years as an example, produced with the following code:
HNRrawfile_2018 = HNR_source_path 'HNR_AR2018_XLS_en.xlsx'
df2018_assets = pd.read_excel(
HNRrawfile_2018,
sheet_name=assets,
skiprows=4,
skipfooter=1,
usecols=[0, 2])
df2018_assets.columns = ['ITEM', '2018']
df2018_assets['2018'] = df2018_assets['2018'].astype(int)
df2018_assets['ITEM'] = df2018_assets['ITEM'].str.strip().str.lower()
print(df2018_assets)
ITEM 2019
0 fixed-income securities - held to maturity 223049
1 fixed-income securities - loans and receivables 2194064
2 fixed-income securities - available for sale 38068459
3 fixed-income securities - at fair value throug... 578779
4 equity securities - available for sale 29215
5 other financial assets - at fair value through... 235019
6 investment property 1749517
7 real estate funds 534739
8 investments in associated companies 245478
9 other invested assets 2211905
10 short-term investments 468350
11 cash and cash equivalents 1090852
12 total investments and cash under own management 47629426
13 funds withheld 10948469
14 contract deposits 325302
15 total investments 58903197
16 reinsurance recoverables on unpaid claims 2050114
17 reinsurance recoverables on benefit reserve 852598
18 prepaid reinsurance premium 116176
19 reinsurance recoverables on other technical re... 9355
20 deferred acquisition costs 2931722
21 accounts receivable 5269792
22 goodwill 88303
23 deferred tax assets 442469
24 other assets 640956
25 accrued interest and rent 15414
26 assets held for sale 36308
27 total assets 71356404
ITEM 2018
0 fixed-income securities – held to maturity 249943
1 fixed-income securities – loans and receivables 2398950
2 fixed-income securities – available for sale 33239685
3 fixed-income securities – at fair value throug... 559750
4 equity securities – available for sale 28729
5 other financial assets – at fair value through... 190759
6 investment property 1684932
7 real estate funds 433899
8 investments in associated companies 110545
9 other invested assets 1805281
10 short-term investments 421950
11 cash and cash equivalents 1072915
12 total investments and cash under own management 42197338
13 funds withheld 10691768
14 contract deposits 172873
15 total investments 53061979
16 reinsurance recoverables on unpaid claims 2084630
17 reinsurance recoverables on benefit reserve 909056
18 prepaid reinsurance premium 93678
19 reinsurance recoverables on other technical re... 7170
20 deferred acquisition costs 2155820
21 accounts receivable 3975778
22 goodwill 85588
23 deferred tax assets 454608
24 other assets 629420
25 accrued interest and rent 11726
26 assets held for sale 1039184
27 total assets 64508637
ITEM 2017
0 fixed-income securities – held to maturity 336182
1 fixed-income securities – loans and receivables 2455164
2 fixed-income securities – available for sale 31281908
3 fixed-income securities – at fair value throug... 212042
4 equity securities – available for sale 37520
5 other financial assets – at fair value through... 88832
6 real estate and real estate funds 1968702
7 investments in associated companies 121075
8 other invested assets 1761678
9 short-term investments 958669
10 cash and cash equivalents 835706
11 total investments and cash under own management 40057478
12 funds withheld 10735012
13 contract deposits 167854
14 total investments 50960344
15 reinsurance recoverables on unpaid claims 1651335
16 reinsurance recoverables on benefit reserve 959533
17 prepaid reinsurance premium 96402
18 reinsurance recoverables on other technical re... 7301
19 deferred acquisition costs 2228246
20 accounts receivable 3821124
21 goodwill 91692
22 deferred tax assets 466564
23 other assets 904253
24 accrued interest and rent 10052
25 assets held for sale 0
26 total assets 61196846
However some rows get doubled, especially the fixed income rows, which appear on top for the years 2021 until 2019 but then from 2018 get doubled on the bottom:
ITEM 2021 2020 2019 2018 2017 2016 2015 2014 2013
0 fixed-income securities - held to maturity 48632.0 185577.0 223049.0 NaN NaN NaN NaN NaN NaN
1 fixed-income securities - loans and receivables 2443629.0 2532146.0 2194064.0 NaN NaN NaN NaN NaN NaN
2 fixed-income securities - available for sale 45473677.0 38851723.0 38068459.0 NaN NaN NaN NaN NaN NaN
3 fixed-income securities - at fair value throug... 81308.0 105711.0 578779.0 NaN NaN NaN NaN NaN NaN
4 equity securities - available for sale 314453.0 378422.0 29215.0 NaN NaN NaN NaN NaN NaN
5 other financial assets - at fair value through... 248233.0 234689.0 235019.0 NaN NaN NaN NaN NaN NaN
6 investment property 1818754.0 1589238.0 1749517.0 1684932.0 NaN NaN NaN NaN NaN
7 real estate funds 805912.0 582296.0 534739.0 433899.0 NaN NaN NaN NaN NaN
8 investments in associated companies 238110.0 361617.0 245478.0 110545.0 121075.0 114633.0 128008.0 154822.0 144489.0
9 other invested assets 2941633.0 2794016.0 2211905.0 1805281.0 1761678.0 1764678.0 1544533.0 1316604.0 1023214.0
10 short-term investments 443793.0 327426.0 468350.0 421950.0 958669.0 838987.0 1113130.0 575300.0 549138.0
11 cash and cash equivalents 1355114.0 1278071.0 1090852.0 1072915.0 835706.0 848667.0 792604.0 772882.0 642936.0
12 total investments and cash under own management 56213248.0 49220932.0 47629426.0 42197338.0 40057478.0 41793495.0 39346903.0 36228010.0 31875242.0
13 funds withheld 10803071.0 9659807.0 10948469.0 10691768.0 10735012.0 11673259.0 13801845.0 15826480.0 14267831.0
14 contract deposits 503412.0 298344.0 325302.0 172873.0 167854.0 170505.0 188604.0 92069.0 75541.0
15 total investments 67519731.0 59179083.0 58903197.0 53061979.0 50960344.0 53637259.0 53337352.0 52146559.0 46218614.0
16 reinsurance recoverables on unpaid claims 2674107.0 1883270.0 2050114.0 2084630.0 1651335.0 1506292.0 1395281.0 1376432.0 1403804.0
17 reinsurance recoverables on benefit reserve 192039.0 192135.0 852598.0 909056.0 959533.0 1189420.0 1367173.0 676219.0 344154.0
18 prepaid reinsurance premium 204597.0 165916.0 116176.0 93678.0 96402.0 134927.0 164023.0 149257.0 139039.0
19 reinsurance recoverables on other technical re... 2703.0 1106.0 9355.0 7170.0 7301.0 12231.0 8687.0 5446.0 6893.0
20 deferred acquisition costs 3350633.0 2857071.0 2931722.0 2155820.0 2228246.0 2198089.0 2094671.0 1914598.0 1672398.0
21 accounts receivable 7207750.0 5605803.0 5269792.0 3975778.0 3821124.0 3678030.0 3665937.0 3113978.0 2945685.0
22 goodwill 83933.0 80965.0 88303.0 85588.0 91692.0 64609.0 60244.0 58220.0 57070.0
23 deferred tax assets 676344.0 597986.0 442469.0 454608.0 466564.0 408292.0 433500.0 393923.0 508841.0
24 other assets 972167.0 858170.0 640956.0 629420.0 904253.0 674389.0 680543.0 618280.0 603627.0
25 accrued interest and rent 18248.0 18264.0 15414.0 11726.0 10052.0 9978.0 7527.0 4672.0 4193.0
26 total assets 82902252.0 71439769.0 71356404.0 64508637.0 61196846.0 63528602.0 63214938.0 60457584.0 53915544.0
27 assets held for sale NaN 0.0 36308.0 1039184.0 0.0 15086.0 NaN 0.0 11226.0
28 fixed-income securities – held to maturity NaN NaN NaN 249943.0 336182.0 484955.0 1007665.0 2139742.0 2666787.0
29 fixed-income securities – loans and receivables NaN NaN NaN 2398950.0 2455164.0 2563594.0 2869865.0 2988187.0 3209100.0
30 fixed-income securities – available for sale NaN NaN NaN 33239685.0 31281908.0 32182173.0 29616448.0 26817523.0 22409892.0
31 fixed-income securities – at fair value throug... NaN NaN NaN 559750.0 212042.0 239917.0 108982.0 64494.0 36061.0
32 equity securities – available for sale NaN NaN NaN 28729.0 37520.0 905307.0 452108.0 32804.0 28980.0
33 other financial assets – at fair value through... NaN NaN NaN 190759.0 88832.0 57665.0 39602.0 66394.0 70082.0
34 real estate and real estate funds NaN NaN NaN NaN 1968702.0 1792919.0 1673958.0 1299258.0 1094563.0
The merge is made by using outer, because some positions do not appear in each year and I'd like to make sure to collect every row from the left and right table:
df_assets_1 = df2021_assets.merge(
df2020_assets, how='outer', on='ITEM').merge(
df2019_assets, how='outer', on='ITEM').merge(
df2018_assets, how='outer', on='ITEM').merge(
df2017_assets, how='outer', on='ITEM').merge(
df2016_assets, how='outer', on='ITEM').merge(
df2015_assets, how='outer', on='ITEM').merge(
df2014_assets, how='outer', on='ITEM').merge(
df2013_assets, how='outer', on='ITEM')
I tried to fix it by
- making sure that ITEM descriptions are stripped and all lowercased
- using .concat on axis=1 instead of .merge
- using left-join, but that cut the positions from the right table (as expected)
- try it without on='ITEM', but that did not make any difference
Any help on that is very appreciated - many thanks for your valuable time!
CodePudding user response:
I checked the process of joining by using M in PowerQuery and it is necessary to clean the data after nearly each step, because some positions get doubled due to the nature of financial results (which may have an overarching 'title-row' (shareholders' value for example) and another row with the same name that holds the sum of shareholders' value).
Thus after each outer merge possible duplicates have to be removed before going on with the next outer merge. This caused the duplicates.