I have a lot of dataframes in the form of:
name something else
0 nm1 sm1
1 nm2 sm2
2 nm3 sm3
3 nm4 sm4
4 nm5 sm5
5 nm6 sm6
And I would like to combine them based on name but only if they are from the same year. Whether they are from the same year can be seen from the file name which is in the form "something_else//2014_some file name.csv". So if I have another file from the year 2014 which looks something like this:
name something else2
0 nm1 lol1
1 nm2 lol2
2 nm3 lol3
3 nm4 lol4
4 nm5 lol5
5 nm6 lol6
it should return a merged form:
name something else1 something else2
0 nm1 sm1 lol1
1 nm2 sm2 lol2
2 nm3 sm3 lol3
3 nm4 sm4 lol4
4 nm5 sm5 lol5
5 nm6 sm6 lol6
However, if there is another year it should concatenate and produce something in the form of:
name something else1 something else2
0 nm1 sm1 lol
1 nm2 sm2 lol2
2 nm3 sm3 lol3
3 nm4 sm4 lol4
4 nm5 sm5 lol5
5 nm6 sm6 lol6
7 nm7 bla bla
8 nm8 bla bla
9 nm9 bla bla
10 nm10 bla bla
11 nm11 bla bla
12 nm12 bla bla
Note that dataframes from different years have the same something elses (columns), but of course different values in them. It would also be nice if I could produce another column called year which would display the year to which the dataframe corresponds so for example the first dataset (only merged would look like):
name year something else1 something else2
0 nm1 2014 sm1 lol1
1 nm2 2014 sm2 lol2
2 nm3 2014 sm3 lol3
3 nm4 2014 sm4 lol4
4 nm5 2014 sm5 lol5
5 nm6 2014 sm6 lol6
Code I have up till now is:
spatial paths= list of all names of files (first element is spatial_search_intensity//2004_spatial_diabetic ketoacidosis.csv)
df5 = pd.read_csv("directory in google drive" str(spatial_paths[0]))
df5 = df5.set_index("Name")
df5
for s_path in spatial_paths:
variable_name= re.findall("\d{4}_spatial_(. ).csv",s_path)
year = re.findall("(\d{4})_spatial_. \.csv",s_path)
df_new = pd.read_csv("directory in google drive" str(s_path))
df_new= df_new.set_index("Name")
df5 = pd.merge(df5,df_new, left_index=True,right_index=True)
df5
Code is not good because I have zero ideas how to proceed.
CodePudding user response:
I think I understand what you want, but maybe this will give you some ideas if you have to tweak it. I created six files, 2 each year that had similar column names but different data. comments in line
# first get a list of all files that will need to be read
all_files = glob.glob("*_file*.csv")
# find all years for those files (in my case three years)
years = list(set(re.findall("(\d{4})_", ', '.join(all_files))))
#iterate over years concatenating each file to the right (instead of merging assuming the rows and names are equivalent - then dedup column names and add year.
dfyearslist = []
for year in years:
# get the the year's files in question
yearly_files = glob.glob(year "_file*.csv")
# print(yearly_files)
dflist = []
for f in yearly_files:
dft =pd.read_csv(f, sep=',')
dflist.append(dft)
df = pd.concat(dflist,axis=1) #axis = 1, horizontal
df = df.loc[:,~df.columns.duplicated()]
df['year'] = year
dfyearslist.append(df)
df_final = pd.concat(dfyearslist) # defaults axis = 0, vertical`enter code here`
print(df_final)
Outptut
My columns are named cola and colb for each of the three years
name cola colb year
0 nm1 sm1 lol1 2014
1 nm2 sm2 lol2 2014
2 nm3 sm3 lol3 2014
3 nm4 sm4 lol4 2014
4 nm5 sm5 lol5 2014
5 nm6 sm6 lol6 2014
0 nm10 sm1 lol1 2015
1 nm11 sm2 lol2 2015
2 nm12 sm3 lol3 2015
3 nm13 sm4 lol4 2015
4 nm14 sm5 lol5 2015
5 nm15 sm6 lol6 2015
0 nm20 sm1 lol1 2016
1 nm21 sm2 lol2 2016
2 nm22 sm3 lol3 2016
3 nm23 sm4 lol4 2016
4 nm24 sm5 lol5 2016
5 nm25 sm6 lol6 2016
CodePudding user response:
Sounds like a "powerquery" task, load all csv's and choose "transform". from their you can add the years (add custom col), combine all files in new query and apply your logic.