Is there any way to concat multiple dataframes into one, but getting the name of each dataframe as index, or in the best case, get as a new column?
I have 17 dataframes and i want a identifier to know which data belongs to which dataframe.
i have these list of dataframes
frames = [fp_2015, fp_2016, fp_2017, fp_2018, fp_2019, fp_2020, res_FirmPrivs_2016, res_FirmPrivs_2017, res_FirmPrivs_2018, res_FirmPrivs_2019, res_FirmPrivs_2020, res_uai_2015, res_uai_2016,res_uai_2017,res_uai_2018, res_uai_2019, res_uai_2020]
I'd like to get something like this:
Dataframe_name | Institucion | filename | link |
---|---|---|---|
fp_2015. | CONACAFE. | RIA-CGR-001-15-AFYC-CONACAFE | index.php/repositorio/send/23-cgr/1977-ria-cg |
fp_2016 | CiudadSandino | RIA-CGR-001-15-AFYC-CONACAFE | /index.php/repositorio/send/23-cgr/1977-ria-cg.. |
I get these DF by extracting it from a website, i have it individually but documentation of pandas does not give a solution with concatenate function, just to assign it manually with "keys".
Thank you for your help
CodePudding user response:
Using simple example data, here's what you're trying to do:
from pandas import DataFrame, concat
dfs = {
'kids1': DataFrame([
{'id': 'A', 'name': 'Alice', 'age': 12},
{'id': 'B', 'name': 'Bob', 'age': 13}
]).set_index('id'),
'kids2': DataFrame([
{'id': 'B', 'name': 'Beau', 'age': 9},
{'id': 'C', 'name': 'Carla', 'age': 11}
]).set_index('id')
}
# just two regular dataframes
print(dfs['kids1'])
print(dfs['kids2'])
result = concat([
df.assign(source=name).reset_index().set_index(['source', 'id'])
for name, df in dfs.items()
], axis=0)
print(result)
Output:
name age
id
A Alice 12
B Bob 13
name age
id
B Beau 9
C Carla 11
name age
source id
kids1 A Alice 12
B Bob 13
kids2 B Beau 9
C Carla 11
And this now allows things like:
print(result.loc['kids1'])
print(result.loc[result.index.get_level_values(1)=='B']) # or:
print(result.loc[(slice(None), slice('B', 'B')), :])
Output:
name age
id
A Alice 12
B Bob 13
name age
source id
kids1 B Bob 13
kids2 B Beau 9
name age
source id
kids1 B Bob 13
kids2 B Beau 9
The result
is created from a dictionary of dataframes, to avoid dealing with variable names (which you shouldn't involve in your data), instead of a list of variables.
It's created using pd.concat()
, over axis=0
, i.e. combining the rows of the dataframes involved.
This part:
df.assign(source=name).reset_index().set_index(['source', 'id'])
Takes each of the dataframes, adds a column called 'source'
and sets it to the name (key) from the dictionary. It then resets the index (to get access to the id
columns) and sets the index to be a combination of the 'source'
and 'id'
columns.