Home > Blockchain >  Pandas python merge columns
Pandas python merge columns

Time:07-20

i`m trying to merge a column from another .xlsx file to another file.

This is the column(repoPath) that i want to add from another file: repoPath from another file

And this is the file where i want to add it:

Where to add "column C"

And this is my script:

from svnscripts.timestampdirectory import  createdir,path_dir
import os
import time

def gitrepolastchangedate():
    pathdest=path_dir()
    dest=createdir()
    timestr = time.strftime("%Y-%m-%d")
    #below you read the file from the declared path

    ###processing FOR "TCDD"
    df = pd.read_csv(rf"{pathdest}\{timestr}-rawGitData-reposLastChangeDate-tcdd.csv", sep='|', header=None)
    dfGitReposSize=pd.read_excel(os.path.join(dest,"GitReposSize.xlsx"))
    df.columns = ['repoPath']
    df[['repoPath', 'repoLastChangeDate']] = df["repoPath"].str.split(";", 1, expand=True)
    df["repoLastChangeDate"] = df["repoLastChangeDate"].str.strip("-")
    df.to_excel(os.path.join(dest, "GitRepoLastChangeDate.xlsx"), index=False)
    Result = pd.merge(
        df,
        dfGitReposSize[
            [
                'repoPath',


            ]
        ],
        left_on='repoPath',
        right_on='repoPath'
    )
    Result.fillna("N/A", inplace=True)
    print(Result)
    #Result.to_excel(os.path.join(dest, "GitRepoLastChangeDate.xlsx"), index=False)
gitrepolastchangedate()

The idea is that im trying to add the 'repoPath' from the other file comparing the keys, but idea is that the keys isn`t exactly the same. Because on file that i want to add the keys contains in the begining that : /srv/git key

So what im thinking its the right key should contains a str{/srv/git} key or idk..

The output that i receive is: Its not even creating the 3rd column...

output

Example for both file.xlsx:

file1:

repoName             repoSize        repoPath
./m4r_media-sync.git    812K    tcdd/m4r_media-sync.git
./m4r_isd-software.git  538M    tcdd/m4r_isd-software.git
./ext_oam-updater.git   2.5M    tcdd/ext_oam-updater.git
./build.git 9.8M    tcdd/build.git
./m4r_isd-adminapp.git  3.5M    tcdd/m4r_isd-adminapp.git
./m4r_moving-map.git    4.5M    tcdd/m4r_moving-map.git
./m4r_inet-portal.git   332K    tcdd/m4r_inet-portal.git

FILE WHERE TO ADD:

repoPath    repoLastChangeDate
/srv/git/tcdd/build.git 05/23/2022
/srv/git/tcdd/ext_apache2.2-fix.git 10/18/2016
/srv/git/tcdd/ext_oam-updater.git   05/19/2022
/srv/git/tcdd/m4r_cctv-trx.git  07/28/2017
/srv/git/tcdd/m4r_ffmpeg-tcdd.git   08/04/2016
/srv/git/tcdd/m4r_fleetstat.git 09/11/2020

so the column "repoPath" from file 1 needs to be add on file 2 on column C having the name (repoName) for the coresponding row of this file

CodePudding user response:

Initialize the dataframes

df1 = pd.DataFrame([
["./m4r_media-sync.git",    "812K",    "tcdd/m4r_media-sync.git"],
["./m4r_isd-software.git",  "538M" ,   "tcdd/m4r_isd-software.git"],
["./ext_oam-updater.git",   "2.5M"  ,  "tcdd/ext_oam-updater.git"],
["./build.git", "9.8M"    "tcdd/build.git"],
["./m4r_isd-adminapp.git",  "3.5M"   , "tcdd/m4r_isd-adminapp.git"],
["./m4r_moving-map.git",    "4.5M"    ,"tcdd/m4r_moving-map.git"],
["./m4r_inet-portal.git",   "332K"  ,  "tcdd/m4r_inet-portal.git"],
    
], columns=["repoName" ,   "repoSize" ,   "repoPath"])

df2 = pd.DataFrame([

["/srv/git/tcdd/build.git", "05/23/2022"],
["/srv/git/tcdd/ext_apache2.2-fix.git", "10/18/2016"],
["/srv/git/tcdd/ext_oam-updater.git", "  05/19/2022"],
["/srv/git/tcdd/m4r_cctv-trx.git", " 07/28/2017"],
["/srv/git/tcdd/m4r_ffmpeg-tcdd.git", "  08/04/2016"],
["/srv/git/tcdd/m4r_fleetstat.git", "09/11/2020"]

], columns=["repoPath"   , "repoLastChangeDate"])

Create a new column for merge

df2["repoName"] = str("./")   df2.repoPath.str.split("/").str[-1]

Merge both dataframes based on repoName

result_df = df1.merge(df2, on=["repoName"], how="inner")

Rename columns

result_df.rename({"repoPath_y" : "repoPath", "repoPath_x": "repoName"}, axis=1, inplace=True)

Extract output:

print(result_df[["repoPath", "repoName", "repoLastChangeDate"]])

Sample output:

    repoPath    repoName    repoName    repoLastChangeDate
0   /srv/git/tcdd/ext_oam-updater.git   ./ext_oam-updater.git   tcdd/ext_oam-updater.git    05/19/2022
1   /srv/git/tcdd/build.git ./build.git None    05/23/2022

CodePudding user response:

As I see all your rows for repoPath in the first dataframe contains prefix /srv/git. We can merge both the dataframe on repoPath excluding the prefix /srv/git/

df1 = pd.read_excel(r"GitReposSize.xlsx")
df2 = pd.read_excel(r"GitRepoLastChangeDate.xlsx")
df1['newrepoPath'] = df1['repoPath'].map('/srv/git/{}'.format)
df1 = pd.merge(df1,df2, left_on='newrepoPath',right_on='repoPath', how='left').drop(['newrepoPath'], axis=1)

Saving as excel

df1.to_excel(r'FinalResult.xlsx', index = False)
  • Related