Home > other >  Why rows get doubled when merging multiple frames with pandas in python
Why rows get doubled when merging multiple frames with pandas in python

Time:05-27

I'm trying to merge multiple year end results (2013 until 2021) which I read with pandas from excel - see those three years as an example, produced with the following code:

HNRrawfile_2018 = HNR_source_path 'HNR_AR2018_XLS_en.xlsx'

df2018_assets = pd.read_excel(
    HNRrawfile_2018,
    sheet_name=assets,
    skiprows=4,
    skipfooter=1,
    usecols=[0, 2])
df2018_assets.columns = ['ITEM', '2018']
df2018_assets['2018'] = df2018_assets['2018'].astype(int)
df2018_assets['ITEM'] = df2018_assets['ITEM'].str.strip().str.lower()
print(df2018_assets)
                                                 ITEM      2019
0          fixed-income securities - held to maturity    223049
1     fixed-income securities - loans and receivables   2194064
2        fixed-income securities - available for sale  38068459
3   fixed-income securities - at fair value throug...    578779
4              equity securities - available for sale     29215
5   other financial assets - at fair value through...    235019
6                                 investment property   1749517
7                                   real estate funds    534739
8                 investments in associated companies    245478
9                               other invested assets   2211905
10                             short-term investments    468350
11                          cash and cash equivalents   1090852
12    total investments and cash under own management  47629426
13                                     funds withheld  10948469
14                                  contract deposits    325302
15                                  total investments  58903197
16          reinsurance recoverables on unpaid claims   2050114
17        reinsurance recoverables on benefit reserve    852598
18                        prepaid reinsurance premium    116176
19  reinsurance recoverables on other technical re...      9355
20                         deferred acquisition costs   2931722
21                                accounts receivable   5269792
22                                           goodwill     88303
23                                deferred tax assets    442469
24                                       other assets    640956
25                          accrued interest and rent     15414
26                               assets held for sale     36308
27                                       total assets  71356404

                                                 ITEM      2018
0          fixed-income securities – held to maturity    249943
1     fixed-income securities – loans and receivables   2398950
2        fixed-income securities – available for sale  33239685
3   fixed-income securities – at fair value throug...    559750
4              equity securities – available for sale     28729
5   other financial assets – at fair value through...    190759
6                                 investment property   1684932
7                                   real estate funds    433899
8                 investments in associated companies    110545
9                               other invested assets   1805281
10                             short-term investments    421950
11                          cash and cash equivalents   1072915
12    total investments and cash under own management  42197338
13                                     funds withheld  10691768
14                                  contract deposits    172873
15                                  total investments  53061979
16          reinsurance recoverables on unpaid claims   2084630
17        reinsurance recoverables on benefit reserve    909056
18                        prepaid reinsurance premium     93678
19  reinsurance recoverables on other technical re...      7170
20                         deferred acquisition costs   2155820
21                                accounts receivable   3975778
22                                           goodwill     85588
23                                deferred tax assets    454608
24                                       other assets    629420
25                          accrued interest and rent     11726
26                               assets held for sale   1039184
27                                       total assets  64508637

                                                 ITEM      2017
0          fixed-income securities – held to maturity    336182
1     fixed-income securities – loans and receivables   2455164
2        fixed-income securities – available for sale  31281908
3   fixed-income securities – at fair value throug...    212042
4              equity securities – available for sale     37520
5   other financial assets – at fair value through...     88832
6                   real estate and real estate funds   1968702
7                 investments in associated companies    121075
8                               other invested assets   1761678
9                              short-term investments    958669
10                          cash and cash equivalents    835706
11    total investments and cash under own management  40057478
12                                     funds withheld  10735012
13                                  contract deposits    167854
14                                  total investments  50960344
15          reinsurance recoverables on unpaid claims   1651335
16        reinsurance recoverables on benefit reserve    959533
17                        prepaid reinsurance premium     96402
18  reinsurance recoverables on other technical re...      7301
19                         deferred acquisition costs   2228246
20                                accounts receivable   3821124
21                                           goodwill     91692
22                                deferred tax assets    466564
23                                       other assets    904253
24                          accrued interest and rent     10052
25                               assets held for sale         0
26                                       total assets  61196846

However some rows get doubled, especially the fixed income rows, which appear on top for the years 2021 until 2019 but then from 2018 get doubled on the bottom:

                                                 ITEM        2021        2020        2019        2018        2017        2016        2015        2014        2013
0          fixed-income securities - held to maturity     48632.0    185577.0    223049.0         NaN         NaN         NaN         NaN         NaN         NaN
1     fixed-income securities - loans and receivables   2443629.0   2532146.0   2194064.0         NaN         NaN         NaN         NaN         NaN         NaN
2        fixed-income securities - available for sale  45473677.0  38851723.0  38068459.0         NaN         NaN         NaN         NaN         NaN         NaN
3   fixed-income securities - at fair value throug...     81308.0    105711.0    578779.0         NaN         NaN         NaN         NaN         NaN         NaN
4              equity securities - available for sale    314453.0    378422.0     29215.0         NaN         NaN         NaN         NaN         NaN         NaN
5   other financial assets - at fair value through...    248233.0    234689.0    235019.0         NaN         NaN         NaN         NaN         NaN         NaN
6                                 investment property   1818754.0   1589238.0   1749517.0   1684932.0         NaN         NaN         NaN         NaN         NaN
7                                   real estate funds    805912.0    582296.0    534739.0    433899.0         NaN         NaN         NaN         NaN         NaN
8                 investments in associated companies    238110.0    361617.0    245478.0    110545.0    121075.0    114633.0    128008.0    154822.0    144489.0
9                               other invested assets   2941633.0   2794016.0   2211905.0   1805281.0   1761678.0   1764678.0   1544533.0   1316604.0   1023214.0
10                             short-term investments    443793.0    327426.0    468350.0    421950.0    958669.0    838987.0   1113130.0    575300.0    549138.0
11                          cash and cash equivalents   1355114.0   1278071.0   1090852.0   1072915.0    835706.0    848667.0    792604.0    772882.0    642936.0
12    total investments and cash under own management  56213248.0  49220932.0  47629426.0  42197338.0  40057478.0  41793495.0  39346903.0  36228010.0  31875242.0
13                                     funds withheld  10803071.0   9659807.0  10948469.0  10691768.0  10735012.0  11673259.0  13801845.0  15826480.0  14267831.0
14                                  contract deposits    503412.0    298344.0    325302.0    172873.0    167854.0    170505.0    188604.0     92069.0     75541.0
15                                  total investments  67519731.0  59179083.0  58903197.0  53061979.0  50960344.0  53637259.0  53337352.0  52146559.0  46218614.0
16          reinsurance recoverables on unpaid claims   2674107.0   1883270.0   2050114.0   2084630.0   1651335.0   1506292.0   1395281.0   1376432.0   1403804.0
17        reinsurance recoverables on benefit reserve    192039.0    192135.0    852598.0    909056.0    959533.0   1189420.0   1367173.0    676219.0    344154.0
18                        prepaid reinsurance premium    204597.0    165916.0    116176.0     93678.0     96402.0    134927.0    164023.0    149257.0    139039.0
19  reinsurance recoverables on other technical re...      2703.0      1106.0      9355.0      7170.0      7301.0     12231.0      8687.0      5446.0      6893.0
20                         deferred acquisition costs   3350633.0   2857071.0   2931722.0   2155820.0   2228246.0   2198089.0   2094671.0   1914598.0   1672398.0
21                                accounts receivable   7207750.0   5605803.0   5269792.0   3975778.0   3821124.0   3678030.0   3665937.0   3113978.0   2945685.0
22                                           goodwill     83933.0     80965.0     88303.0     85588.0     91692.0     64609.0     60244.0     58220.0     57070.0
23                                deferred tax assets    676344.0    597986.0    442469.0    454608.0    466564.0    408292.0    433500.0    393923.0    508841.0
24                                       other assets    972167.0    858170.0    640956.0    629420.0    904253.0    674389.0    680543.0    618280.0    603627.0
25                          accrued interest and rent     18248.0     18264.0     15414.0     11726.0     10052.0      9978.0      7527.0      4672.0      4193.0
26                                       total assets  82902252.0  71439769.0  71356404.0  64508637.0  61196846.0  63528602.0  63214938.0  60457584.0  53915544.0
27                               assets held for sale         NaN         0.0     36308.0   1039184.0         0.0     15086.0         NaN         0.0     11226.0
28         fixed-income securities – held to maturity         NaN         NaN         NaN    249943.0    336182.0    484955.0   1007665.0   2139742.0   2666787.0
29    fixed-income securities – loans and receivables         NaN         NaN         NaN   2398950.0   2455164.0   2563594.0   2869865.0   2988187.0   3209100.0
30       fixed-income securities – available for sale         NaN         NaN         NaN  33239685.0  31281908.0  32182173.0  29616448.0  26817523.0  22409892.0
31  fixed-income securities – at fair value throug...         NaN         NaN         NaN    559750.0    212042.0    239917.0    108982.0     64494.0     36061.0
32             equity securities – available for sale         NaN         NaN         NaN     28729.0     37520.0    905307.0    452108.0     32804.0     28980.0
33  other financial assets – at fair value through...         NaN         NaN         NaN    190759.0     88832.0     57665.0     39602.0     66394.0     70082.0
34                  real estate and real estate funds         NaN         NaN         NaN         NaN   1968702.0   1792919.0   1673958.0   1299258.0   1094563.0

The merge is made by using outer, because some positions do not appear in each year and I'd like to make sure to collect every row from the left and right table:

df_assets_1 = df2021_assets.merge(
    df2020_assets, how='outer', on='ITEM').merge(
    df2019_assets, how='outer', on='ITEM').merge(
    df2018_assets, how='outer', on='ITEM').merge(
    df2017_assets, how='outer', on='ITEM').merge(
    df2016_assets, how='outer', on='ITEM').merge(
    df2015_assets, how='outer', on='ITEM').merge(
    df2014_assets, how='outer', on='ITEM').merge(
    df2013_assets, how='outer', on='ITEM')

I tried to fix it by

  • making sure that ITEM descriptions are stripped and all lowercased
  • using .concat on axis=1 instead of .merge
  • using left-join, but that cut the positions from the right table (as expected)
  • try it without on='ITEM', but that did not make any difference

Any help on that is very appreciated - many thanks for your valuable time!

CodePudding user response:

I checked the process of joining by using M in PowerQuery and it is necessary to clean the data after nearly each step, because some positions get doubled due to the nature of financial results (which may have an overarching 'title-row' (shareholders' value for example) and another row with the same name that holds the sum of shareholders' value).

Thus after each outer merge possible duplicates have to be removed before going on with the next outer merge. This caused the duplicates.

  • Related