Home > Net >  How to get all parent folders of a URL. Each into one line in pandas dataframe
How to get all parent folders of a URL. Each into one line in pandas dataframe

Time:12-07

I want to split dfPagesTotal['fullsection'] which is currently a string like "/sport/tennis/us-open" into 3 news rows with

"/sport"
"/sport/tennis"
"/sport/tennis/us-open"

I tested something like this but just splits (and copies of course correct) but does not keep the the parent paths:

dfPagesTotal = dfPagesTotal.assign(fullsection=dfPagesTotal['fullsection'].str.split('/')).explode('fullsection')

Thanks

CodePudding user response:

If you want to split paths in every single row you might consider to use something like

Data

import pandas as pd


df = pd.DataFrame({"fullsection": 
                   ["/sport/tennis/us-open", 
                    "/sport/tennis/roland-garros"]})

All parent paths

Next we can define a function that generate all the parent path as

def all_paths(lst):
    out = []
    for i, l in enumerate(lst[2:]):
        out.append("/".join(lst[:2 i]))
    out.append("/".join(lst))
    return out

Apply and Explode

Finally we can apply our function and the explode. Here I prefer to add an extra columns so we always know which one was the original path

df["parents"] = df["fullsection"]\
    .str.split("/")\
    .apply(lambda x: all_paths(x))

df = df.explode("parents")

And the output will be

                   fullsection                      parents
0        /sport/tennis/us-open                       /sport
0        /sport/tennis/us-open                /sport/tennis
0        /sport/tennis/us-open        /sport/tennis/us-open
1  /sport/tennis/roland-garros                       /sport
1  /sport/tennis/roland-garros                /sport/tennis
1  /sport/tennis/roland-garros  /sport/tennis/roland-garros

CodePudding user response:

This will help you.

df['name'] = ["/sport/tennis/us-open"]
ss = df.name[0].split('/')
ss.remove('')
new = []
pre = ''
k = 0
for n in ss:
    new.append(pre "/" n)
    pre = new[k]
    k = k  1
df1 = pd.DataFrame({"col":new})
df1

output:

    col
0   /sport
1   /sport/tennis
2   /sport/tennis/us-open

Note: this looks little hideous process. there should be a simple way.

CodePudding user response:

Another poteential solution is this one:

new_df = []
for i in range(3):
    dfPagesTotal = df.assign(fullsection=df['url'].str.split('/', i)).explode('fullsection')
    dfPagesTotal = dfPagesTotal[dfPagesTotal.fullsection != '']
    dfPagesTotal = dfPagesTotal.sort_values('fullsection')
    dfPagesTotal = dfPagesTotal.head(i)
    print(dfPagesTotal)
    new_df.append(dfPagesTotal)

new_df = pd.concat(new_df)

which gives:

               url           fullsection
0  /sport/tennis/us-open  sport/tennis/us-open
0  /sport/tennis/us-open                 sport
0  /sport/tennis/us-open        tennis/us-open
  • Related