Home > Back-end >  Concat multiple dataframes in python with names as index
Concat multiple dataframes in python with names as index

Time:10-01

Is there any way to concat multiple dataframes into one, but getting the name of each dataframe as index, or in the best case, get as a new column?

I have 17 dataframes and i want a identifier to know which data belongs to which dataframe.

i have these list of dataframes

frames = [fp_2015, fp_2016, fp_2017, fp_2018, fp_2019, fp_2020, res_FirmPrivs_2016, res_FirmPrivs_2017, res_FirmPrivs_2018, res_FirmPrivs_2019, res_FirmPrivs_2020, res_uai_2015, res_uai_2016,res_uai_2017,res_uai_2018, res_uai_2019, res_uai_2020]

I'd like to get something like this:

Dataframe_name Institucion filename link
fp_2015. CONACAFE. RIA-CGR-001-15-AFYC-CONACAFE index.php/repositorio/send/23-cgr/1977-ria-cg
fp_2016 CiudadSandino RIA-CGR-001-15-AFYC-CONACAFE /index.php/repositorio/send/23-cgr/1977-ria-cg..

I get these DF by extracting it from a website, i have it individually but documentation of pandas does not give a solution with concatenate function, just to assign it manually with "keys".

Thank you for your help

CodePudding user response:

Using simple example data, here's what you're trying to do:

from pandas import DataFrame, concat

dfs = {
    'kids1': DataFrame([
        {'id': 'A', 'name': 'Alice', 'age': 12},
        {'id': 'B', 'name': 'Bob', 'age': 13}
    ]).set_index('id'),
    'kids2': DataFrame([
        {'id': 'B', 'name': 'Beau', 'age': 9},
        {'id': 'C', 'name': 'Carla', 'age': 11}
    ]).set_index('id')
}

# just two regular dataframes
print(dfs['kids1'])
print(dfs['kids2'])

result = concat([
    df.assign(source=name).reset_index().set_index(['source', 'id']) 
    for name, df in dfs.items()
], axis=0)

print(result)

Output:

     name  age
id            
A   Alice   12
B     Bob   13

     name  age
id            
B    Beau    9
C   Carla   11

            name  age
source id            
kids1  A   Alice   12
       B     Bob   13
kids2  B    Beau    9
       C   Carla   11

And this now allows things like:

print(result.loc['kids1'])

print(result.loc[result.index.get_level_values(1)=='B'])  # or:
print(result.loc[(slice(None), slice('B', 'B')), :])

Output:

     name  age
id            
A   Alice   12
B     Bob   13

           name  age
source id           
kids1  B    Bob   13
kids2  B   Beau    9

           name  age
source id           
kids1  B    Bob   13
kids2  B   Beau    9

The result is created from a dictionary of dataframes, to avoid dealing with variable names (which you shouldn't involve in your data), instead of a list of variables.

It's created using pd.concat(), over axis=0, i.e. combining the rows of the dataframes involved.

This part:

df.assign(source=name).reset_index().set_index(['source', 'id'])

Takes each of the dataframes, adds a column called 'source' and sets it to the name (key) from the dictionary. It then resets the index (to get access to the id columns) and sets the index to be a combination of the 'source' and 'id' columns.

  • Related