I want to split dfPagesTotal['fullsection']
which is currently a string like "/sport/tennis/us-open
" into 3 news rows with
"/sport"
"/sport/tennis"
"/sport/tennis/us-open"
I tested something like this but just splits (and copies of course correct) but does not keep the the parent paths:
dfPagesTotal = dfPagesTotal.assign(fullsection=dfPagesTotal['fullsection'].str.split('/')).explode('fullsection')
Thanks
CodePudding user response:
If you want to split paths in every single row you might consider to use something like
Data
import pandas as pd
df = pd.DataFrame({"fullsection":
["/sport/tennis/us-open",
"/sport/tennis/roland-garros"]})
All parent paths
Next we can define a function that generate all the parent path as
def all_paths(lst):
out = []
for i, l in enumerate(lst[2:]):
out.append("/".join(lst[:2 i]))
out.append("/".join(lst))
return out
Apply and Explode
Finally we can apply our function and the explode. Here I prefer to add an extra columns so we always know which one was the original path
df["parents"] = df["fullsection"]\
.str.split("/")\
.apply(lambda x: all_paths(x))
df = df.explode("parents")
And the output will be
fullsection parents
0 /sport/tennis/us-open /sport
0 /sport/tennis/us-open /sport/tennis
0 /sport/tennis/us-open /sport/tennis/us-open
1 /sport/tennis/roland-garros /sport
1 /sport/tennis/roland-garros /sport/tennis
1 /sport/tennis/roland-garros /sport/tennis/roland-garros
CodePudding user response:
This will help you.
df['name'] = ["/sport/tennis/us-open"]
ss = df.name[0].split('/')
ss.remove('')
new = []
pre = ''
k = 0
for n in ss:
new.append(pre "/" n)
pre = new[k]
k = k 1
df1 = pd.DataFrame({"col":new})
df1
output:
col
0 /sport
1 /sport/tennis
2 /sport/tennis/us-open
Note: this looks little hideous process. there should be a simple way.
CodePudding user response:
Another poteential solution is this one:
new_df = []
for i in range(3):
dfPagesTotal = df.assign(fullsection=df['url'].str.split('/', i)).explode('fullsection')
dfPagesTotal = dfPagesTotal[dfPagesTotal.fullsection != '']
dfPagesTotal = dfPagesTotal.sort_values('fullsection')
dfPagesTotal = dfPagesTotal.head(i)
print(dfPagesTotal)
new_df.append(dfPagesTotal)
new_df = pd.concat(new_df)
which gives:
url fullsection
0 /sport/tennis/us-open sport/tennis/us-open
0 /sport/tennis/us-open sport
0 /sport/tennis/us-open tennis/us-open