I need a help
I will like to write a script in python where string stored in a row is split in different rows based on certain characters
I am trying to extract everything between "a." and ";" (and similarly between "b." and ";" and so on to different rows)
For example (data is stored in pandas dataframe)
c1 | c2 | c3 |
---|---|---|
1 | Sta | a.xxx i. gh ii. kj ;b.yyy; |
2 | Danny | a.xxx;b.y,c.z; |
is split as
c1 | c2 | c3 |
---|---|---|
1 | Sta | xxx i. gh ii. kj |
2 | Sta | yyy |
3 | Danny | xxx |
4 | Danny | y |
5 | Danny | z |
CodePudding user response:
You can use the solution below, which uses regex to parse the string and grab all data between semi colons
# set up dataframe
data = {
"c1": [1, 2],
"c2": ["Sta", "Danny"],
"c3": ["a.xxx;b.yyy;", "a.xxx;b.y;c.z;"]
}
df = pd.DataFrame(data)
# parse string column
df["c4"] = df["c3"].str.findall(r"\w.([^;] )")
# use explode to create a new record for each item found in c3
df = df.explode("c4")
CodePudding user response:
You can use str.extractall
with the \.(?P<c3>[^;,] );?
regex and join
:
df2 = (df.drop(columns='c3')
.join(df['c3'].str.extractall(r'\.(?P<c3>[^;,] );?')
.droplevel(1))
)
output:
c1 c2 c3
0 1 Sta xxx
0 1 Sta yyy
1 2 Danny xxx
1 2 Danny y
1 2 Danny z
only after a|b
df2 = (df.drop(columns='c3')
.join(df['c3'].str.extractall(r'(?:a|b)\.(?P<c3>[^;,] );?')
.droplevel(1))
)
output:
c1 c2 c3
0 1 Sta xxx i. gh ii. kj
0 1 Sta yyy
1 2 Danny xxx
1 2 Danny y