pandas merging and concatenating-CodePudding

I have a lot of dataframes in the form of:

  name    something else
0 nm1          sm1
1 nm2          sm2
2 nm3          sm3
3 nm4          sm4
4 nm5          sm5
5 nm6          sm6

And I would like to combine them based on name but only if they are from the same year. Whether they are from the same year can be seen from the file name which is in the form "something_else//2014_some file name.csv". So if I have another file from the year 2014 which looks something like this:

     name      something else2
     0 nm1         lol1
     1 nm2         lol2
     2 nm3         lol3
     3 nm4         lol4
     4 nm5         lol5
     5 nm6         lol6

it should return a merged form:

      name    something else1  something else2
    0 nm1          sm1                 lol1
    1 nm2          sm2                 lol2
    2 nm3          sm3                 lol3
    3 nm4          sm4                 lol4
    4 nm5          sm5                 lol5
    5 nm6          sm6                 lol6

However, if there is another year it should concatenate and produce something in the form of:

name    something else1  something else2
    0 nm1          sm1                 lol
    1 nm2          sm2                 lol2
    2 nm3          sm3                 lol3
    3 nm4          sm4                 lol4
    4 nm5          sm5                 lol5
    5 nm6          sm6                 lol6
    7 nm7          bla                 bla
    8 nm8          bla                 bla
    9 nm9          bla                 bla
    10 nm10        bla                 bla
    11 nm11        bla                 bla
    12 nm12        bla                 bla

Note that dataframes from different years have the same something elses (columns), but of course different values in them. It would also be nice if I could produce another column called year which would display the year to which the dataframe corresponds so for example the first dataset (only merged would look like):

  name  year    something else1  something else2
0 nm1   2014         sm1                 lol1
1 nm2   2014         sm2                 lol2
2 nm3   2014         sm3                 lol3
3 nm4   2014         sm4                 lol4
4 nm5   2014         sm5                 lol5
5 nm6   2014         sm6                 lol6

Code I have up till now is:

spatial paths= list of all names of files (first element is spatial_search_intensity//2004_spatial_diabetic ketoacidosis.csv)
df5 = pd.read_csv("directory in google drive" str(spatial_paths[0]))
df5 = df5.set_index("Name")
df5
for s_path in spatial_paths:
  variable_name= re.findall("\d{4}_spatial_(. ).csv",s_path)
  year = re.findall("(\d{4})_spatial_. \.csv",s_path)
  df_new = pd.read_csv("directory in google drive" str(s_path))
  df_new= df_new.set_index("Name")
  df5 = pd.merge(df5,df_new, left_index=True,right_index=True)
df5

Code is not good because I have zero ideas how to proceed.

CodePudding user response：

I think I understand what you want, but maybe this will give you some ideas if you have to tweak it. I created six files, 2 each year that had similar column names but different data. comments in line

# first get a list of all files that will need to be read
all_files = glob.glob("*_file*.csv")

# find all years for those files (in my case three years)
years = list(set(re.findall("(\d{4})_", ', '.join(all_files))))

#iterate over years concatenating each file to the right (instead of merging assuming the rows and names are equivalent - then dedup column names and add year.
dfyearslist = []
for year in years:
    # get the the year's files in question
    yearly_files = glob.glob(year "_file*.csv")
    # print(yearly_files)
    dflist = []
    for f in yearly_files:
        dft =pd.read_csv(f, sep=',')
        dflist.append(dft)
    df = pd.concat(dflist,axis=1) #axis = 1, horizontal
    df = df.loc[:,~df.columns.duplicated()]
    df['year'] = year
    dfyearslist.append(df)
df_final = pd.concat(dfyearslist) # defaults axis = 0, vertical`enter code here`
print(df_final)

Outptut

My columns are named cola and colb for each of the three years

   name cola  colb  year
0   nm1  sm1  lol1  2014
1   nm2  sm2  lol2  2014
2   nm3  sm3  lol3  2014
3   nm4  sm4  lol4  2014
4   nm5  sm5  lol5  2014
5   nm6  sm6  lol6  2014
0  nm10  sm1  lol1  2015
1  nm11  sm2  lol2  2015
2  nm12  sm3  lol3  2015
3  nm13  sm4  lol4  2015
4  nm14  sm5  lol5  2015
5  nm15  sm6  lol6  2015
0  nm20  sm1  lol1  2016
1  nm21  sm2  lol2  2016
2  nm22  sm3  lol3  2016
3  nm23  sm4  lol4  2016
4  nm24  sm5  lol5  2016
5  nm25  sm6  lol6  2016

CodePudding user response：

Sounds like a "powerquery" task, load all csv's and choose "transform". from their you can add the years (add custom col), combine all files in new query and apply your logic.