Home > Mobile >  how to separate string from number into different positions in pandas?
how to separate string from number into different positions in pandas?

Time:12-15

I have this dataframe below, but the column "Column_2" have str and date together. How to separate that?

data = {'Column_1': ["foo","bar","foo","foo","bar"],
    'Column_2': ["PAID_202109_ASD","DENIED_202109_ASD", "202108_APPROVED_DSA", 
                 "AUTHORIZED_202107_qtd_DSA","202107_AUTHORIZED_ASD"]}

df = pd.DataFrame(data)

>>> df

enter image description here

expected result:

>>> df

enter image description here

CodePudding user response:

With extract and replace:

df["Data"] = df["Column_2"].str.extract("(\d )")
df["Str"] = df['Column_2'].str.replace("(\d_) "," ", regex=True).str.strip()

>>> df
 Column_1                   Column_2    Data                 Str
0      foo            PAID_202109_ASD  202109            PAID ASD
1      bar          DENIED_202109_ASD  202109          DENIED ASD
2      foo        202108_APPROVED_DSA  202108        APPROVED DSA
3      foo  AUTHORIZED_202107_qtd_DSA  202107  AUTHORIZED qtd DSA
4      bar      202107_AUTHORIZED_ASD  202107      AUTHORIZED ASD

CodePudding user response:

First explode your Column_2 then create your 2 columns:

out = df['Column_2'].str.split('_').explode()
mask = out.str.isdigit()  # or .str.fullmatch('\d{6}')
df['Data'] = out[mask]
df['Str'] = out[~mask].groupby(level=0).apply(' '.join)
print(df)

# Output:
  Column_1                   Column_2    Data                 Str
0      foo            PAID_202109_ASD  202109            PAID ASD
1      bar          DENIED_202109_ASD  202109          DENIED ASD
2      foo        202108_APPROVED_DSA  202108        APPROVED DSA
3      foo  AUTHORIZED_202107_qtd_DSA  202107  AUTHORIZED qtd DSA
4      bar      202107_AUTHORIZED_ASD  202107      AUTHORIZED ASD

I used str.isdigit() to determine if the row is a number or not. You can also use something like str.fullmatch('\d{6}').

CodePudding user response:

df["Str"] =df['Column_2'].str.replace('[\_]',' ', regex=True).str.replace('\d \s ','', regex=True)#First replace - with nothing and all digits followed by space with nothing
df["data"] =df['Column_2'].str.replace('[^\d]',' ', regex=True).str.strip()#Replace everything except digit



 Column_1                   Column_2                 Str    data
0      foo            PAID_202109_ASD            PAID ASD  202109
1      bar          DENIED_202109_ASD          DENIED ASD  202109
2      foo        202108_APPROVED_DSA        APPROVED DSA  202108
3      foo  AUTHORIZED_202107_qtd_DSA  AUTHORIZED qtd DSA  202107
4      bar      202107_AUTHORIZED_ASD      AUTHORIZED ASD  202107
  • Related