I have a pandas dataframe of multiple columns. But the Column of interest say is Col A which looks like :
dfInput
A
12 - ksjksu,nsusi,9018,1.00uy,9.0Vm, / - 20%(0.22suns); 891- 1o19jsksuisolslskosBN
205 - lksiosslsoujhs%ysus(0.33mismsn);31 - jsks,msnu
22 - 0204 - 2762.0uhsyuskis;14 - gnshj,msuis,lsolso2.2n; 67181 - iuwuwiwi
So my goal is to split the strings into 2 parts one by the first '-' before/after the semicolon and then by the rest where the first column will have all the numbers on the left of first '-' and then the other column will have rest . So my output dataframe should look like:
**dfOutDesired**
B C
12 891 ksjksu,nsusi,9018,1.00uy,9.0Vm, / - 20%(0.22suns) 1o19jsksuisolslskosBN
205 31 lksiosslsoujhs%ysus(0.33mismsn) jsks,msnu
22 14 67181 0204 - 2762.0uhsyuskis gnshj,msuis,lsolso2.2n iuwuwiwi
So till now I have tried the following based on earlier posts I saw:
df['C'] = df["A"].str.replace(r'\w \s*-\s*', '', regex=True)
df['A'] = df['A'].str.replace(' ', '')
df['A'] = df['A'].str.replace('-', ' - ')
df["B"] = df["A"].str.replace(r'\s*;\s*\S ; \S ; \s*', ' ', regex=True).str.strip()
So I am getting partially correct output for Col C (where say 0204 in third row is not picked up) , but for Col B I am just getting the same Col A as follows:
**mydfWrong**
B
12 - ksjksu,nsusi,9018,1.00uy,9.0Vm, / - 20%(0.22suns); 891- 1o19jsksuisolslskosBN
205 - lksiosslsoujhs%ysus(0.33mismsn);31 - jsks,msnu
22 - 0204 - 2762.0uhsyuskis;14 - gnshj,msuis,lsolso2.2n; 67181 - iuwuwiwi
C
ksjksu,nsusi,9018,1.00uy,9.0Vm, / - 20%(0.22suns) 1o19jsksuisolslskosBN
lksiosslsoujhs%ysus(0.33mismsn) jsks,msnu
2762.0uhsyuskis gnshj,msuis,lsolso2.2n iuwuwiwi
Not sure how to use the regex properly in this case. So any help will be immensely appreciated. Thanks in advance.
CodePudding user response:
You can do that with something like the following.
The get_nums
function splits on the ;
, then removes the integers based on the specified conditions
The get_the_rest
function does the opposite, and substitutes those values with empty spaces
def get_nums(x):
return " ".join([re.search(r'^[0-9] ', y.strip()).group() for y in x.split(";")])
def get_the_rest(x):
return "".join([re.sub(r'[0-9] \s?\-', '', y.strip()) for y in x.split(";")])
df["B"] = df["A"].apply(lambda x: get_nums(x))
df["C"] = df["A"].apply(lambda x: get_the_rest(x))
CodePudding user response:
You can use a single regex call to findall
and then grab only the necessary data:
import pandas as pd
df = pd.DataFrame({'A': ['205 - lksiosslsoujhs%ysus(0.33mismsn);31 - jsks,msnu', '22 - 0204 - 2762.0uhsyuskis;14 - gnshj,msuis,lsolso2.2n; 67181 - iuwuwiwi']})
rx = r'(\d )\s*-\s*(.*?)(?=;\s*\d|$)' # Define the regex
m = df['A'].str.findall(rx) # Get all matches and captures
df["B"] = m.apply(lambda x: " ".join(c[0] for c in x)) # Join Group 1 values into Column B
df["C"] = m.apply(lambda x: " ".join(c[1] for c in x)) # Join Group 1 values into Column C
Output:
>>> print(df.to_string())
A B C
0 205 - lksiosslsoujhs%ysus(0.33mismsn);31 - jsks,msnu 205 31 lksiosslsoujhs%ysus(0.33mismsn) jsks,msnu
1 22 - 0204 - 2762.0uhsyuskis;14 - gnshj,msuis,lsolso2.2n; 67181 - iuwuwiwi 22 14 67181 0204 - 2762.0uhsyuskis gnshj,msuis,lsolso2.2n iuwuwiwi
See the regex demo that matches
(\d )
- Group 1: one or more digits\s*-\s*
- a-
enclosed with zero or more whitespaces(.*?)
- Group 2: any zero or more chars other than line break chars as few as possible(?=;\s*\d|$)
- positive lookahead that matches a location that is immediately followed with;
zero or more whitespaces and then a digit, or end of string.