Home > database >  how to split pandas rows based on different character combination
how to split pandas rows based on different character combination

Time:04-05

I need a help

I will like to write a script in python where string stored in a row is split in different rows based on certain characters

I am trying to extract everything between "a." and ";" (and similarly between "b." and ";" and so on to different rows)

For example (data is stored in pandas dataframe)

c1 c2 c3
1 Sta a.xxx i. gh ii. kj ;b.yyy;
2 Danny a.xxx;b.y,c.z;


is split as

c1 c2 c3
1 Sta xxx i. gh ii. kj
2 Sta yyy
3 Danny xxx
4 Danny y
5 Danny z

CodePudding user response:

You can use the solution below, which uses regex to parse the string and grab all data between semi colons

# set up dataframe
data = {
    "c1":  [1, 2],
    "c2": ["Sta", "Danny"],
    "c3": ["a.xxx;b.yyy;", "a.xxx;b.y;c.z;"]
}
df = pd.DataFrame(data)

# parse string column
df["c4"] = df["c3"].str.findall(r"\w.([^;] )")

# use explode to create a new record for each item found in c3
df = df.explode("c4")

CodePudding user response:

You can use str.extractall with the \.(?P<c3>[^;,] );? regex and join:

df2 = (df.drop(columns='c3')
         .join(df['c3'].str.extractall(r'\.(?P<c3>[^;,] );?')
                       .droplevel(1))
      )

output:

   c1     c2   c3
0   1    Sta  xxx
0   1    Sta  yyy
1   2  Danny  xxx
1   2  Danny    y
1   2  Danny    z
only after a|b
df2 = (df.drop(columns='c3')
         .join(df['c3'].str.extractall(r'(?:a|b)\.(?P<c3>[^;,] );?')
                       .droplevel(1))
      )

output:

   c1     c2                 c3
0   1    Sta  xxx i. gh ii. kj 
0   1    Sta                yyy
1   2  Danny                xxx
1   2  Danny                  y
  • Related