i`m trying to merge a column from another .xlsx file to another file.
This is the column(repoPath) that i want to add from another file:
And this is the file where i want to add it:
And this is my script:
from svnscripts.timestampdirectory import createdir,path_dir
import os
import time
def gitrepolastchangedate():
pathdest=path_dir()
dest=createdir()
timestr = time.strftime("%Y-%m-%d")
#below you read the file from the declared path
###processing FOR "TCDD"
df = pd.read_csv(rf"{pathdest}\{timestr}-rawGitData-reposLastChangeDate-tcdd.csv", sep='|', header=None)
dfGitReposSize=pd.read_excel(os.path.join(dest,"GitReposSize.xlsx"))
df.columns = ['repoPath']
df[['repoPath', 'repoLastChangeDate']] = df["repoPath"].str.split(";", 1, expand=True)
df["repoLastChangeDate"] = df["repoLastChangeDate"].str.strip("-")
df.to_excel(os.path.join(dest, "GitRepoLastChangeDate.xlsx"), index=False)
Result = pd.merge(
df,
dfGitReposSize[
[
'repoPath',
]
],
left_on='repoPath',
right_on='repoPath'
)
Result.fillna("N/A", inplace=True)
print(Result)
#Result.to_excel(os.path.join(dest, "GitRepoLastChangeDate.xlsx"), index=False)
gitrepolastchangedate()
The idea is that im trying to add the 'repoPath' from the other file comparing the keys, but idea is that the keys isn`t exactly the same. Because on file that i want to add the keys contains in the begining that : /srv/git key
So what im thinking its the right key should contains a str{/srv/git} key or idk..
The output that i receive is: Its not even creating the 3rd column...
Example for both file.xlsx:
file1:
repoName repoSize repoPath
./m4r_media-sync.git 812K tcdd/m4r_media-sync.git
./m4r_isd-software.git 538M tcdd/m4r_isd-software.git
./ext_oam-updater.git 2.5M tcdd/ext_oam-updater.git
./build.git 9.8M tcdd/build.git
./m4r_isd-adminapp.git 3.5M tcdd/m4r_isd-adminapp.git
./m4r_moving-map.git 4.5M tcdd/m4r_moving-map.git
./m4r_inet-portal.git 332K tcdd/m4r_inet-portal.git
FILE WHERE TO ADD:
repoPath repoLastChangeDate
/srv/git/tcdd/build.git 05/23/2022
/srv/git/tcdd/ext_apache2.2-fix.git 10/18/2016
/srv/git/tcdd/ext_oam-updater.git 05/19/2022
/srv/git/tcdd/m4r_cctv-trx.git 07/28/2017
/srv/git/tcdd/m4r_ffmpeg-tcdd.git 08/04/2016
/srv/git/tcdd/m4r_fleetstat.git 09/11/2020
so the column "repoPath" from file 1 needs to be add on file 2 on column C having the name (repoName) for the coresponding row of this file
CodePudding user response:
Initialize the dataframes
df1 = pd.DataFrame([
["./m4r_media-sync.git", "812K", "tcdd/m4r_media-sync.git"],
["./m4r_isd-software.git", "538M" , "tcdd/m4r_isd-software.git"],
["./ext_oam-updater.git", "2.5M" , "tcdd/ext_oam-updater.git"],
["./build.git", "9.8M" "tcdd/build.git"],
["./m4r_isd-adminapp.git", "3.5M" , "tcdd/m4r_isd-adminapp.git"],
["./m4r_moving-map.git", "4.5M" ,"tcdd/m4r_moving-map.git"],
["./m4r_inet-portal.git", "332K" , "tcdd/m4r_inet-portal.git"],
], columns=["repoName" , "repoSize" , "repoPath"])
df2 = pd.DataFrame([
["/srv/git/tcdd/build.git", "05/23/2022"],
["/srv/git/tcdd/ext_apache2.2-fix.git", "10/18/2016"],
["/srv/git/tcdd/ext_oam-updater.git", " 05/19/2022"],
["/srv/git/tcdd/m4r_cctv-trx.git", " 07/28/2017"],
["/srv/git/tcdd/m4r_ffmpeg-tcdd.git", " 08/04/2016"],
["/srv/git/tcdd/m4r_fleetstat.git", "09/11/2020"]
], columns=["repoPath" , "repoLastChangeDate"])
Create a new column for merge
df2["repoName"] = str("./") df2.repoPath.str.split("/").str[-1]
Merge both dataframes based on repoName
result_df = df1.merge(df2, on=["repoName"], how="inner")
Rename columns
result_df.rename({"repoPath_y" : "repoPath", "repoPath_x": "repoName"}, axis=1, inplace=True)
Extract output:
print(result_df[["repoPath", "repoName", "repoLastChangeDate"]])
Sample output:
repoPath repoName repoName repoLastChangeDate
0 /srv/git/tcdd/ext_oam-updater.git ./ext_oam-updater.git tcdd/ext_oam-updater.git 05/19/2022
1 /srv/git/tcdd/build.git ./build.git None 05/23/2022
CodePudding user response:
As I see all your rows for repoPath
in the first dataframe contains prefix /srv/git
. We can merge both the dataframe on repoPath
excluding the prefix /srv/git/
df1 = pd.read_excel(r"GitReposSize.xlsx")
df2 = pd.read_excel(r"GitRepoLastChangeDate.xlsx")
df1['newrepoPath'] = df1['repoPath'].map('/srv/git/{}'.format)
df1 = pd.merge(df1,df2, left_on='newrepoPath',right_on='repoPath', how='left').drop(['newrepoPath'], axis=1)
Saving as excel
df1.to_excel(r'FinalResult.xlsx', index = False)