Home > Software engineering >  How to extract part of a string in Pandas column and make a new column
How to extract part of a string in Pandas column and make a new column

Time:11-12

I have the below pandas dataframe.

d = {'col1': [1, 2,3,4,5,60,0,0,6,3,2,4],'col3': [1, 22,33,44,55,60,1,5,6,3,2,4],'Name': ['2a df a1asd_V1', 'xcd a2asd_V3','23vg aabsd_V1','dfgdf_aabsd_V0','a3as  d_V1','aa bsd_V3','aasd_V4','aabsd_V4','aa_adn sd_V15',np.nan,'aasd_V12','aasd120Abs'],'Date': ['2021-06-13', '2021-06-13','2021-06-13','2021-06-14','2021-06-15','2021-06-15','2021-06-13','2021-06-16','2021-06-13','2021-06-13','2021-06-13','2021-06-16']}
dff = pd.DataFrame(data=d)
dff

    col1    col3    Name            Date       
0   1       1       2a df a1asd_V1  2021-06-13 
1   2       22      xcd a2asd_V3    2021-06-13 
2   3       33      23vg aabsd_V1   2021-06-13 
3   4       44      dfgdf_aabsd_V0  2021-06-14 
4   5       55      a3as d_V1       2021-06-15 
5   60      60      aa bsd_V3       2021-06-15 
6   0       1       aasd_V4         2021-06-13 
7   0       5       aabsd_V4        2021-06-16 
8   6       6       aa_adn sd_V10   2021-06-13 
9   3       3       NaN             2021-06-13 
10  2       2       aasd_V12        2021-06-13 
11  4       4       aasd120Abs      2021-06-16 

I want to make two new columns based on the Name column. I want to extract the part of the string in the Name column like V1, V2, V3, V4...V20 like that. Also if there isn't anything like that at end of the Name string or if the Name row is empty, just want to make an empty cell. So I want something like below pandas dataframe.

    col1    col3    Name            Date         Version        Version
0   1       1       2a df a1asd_V1  2021-06-13      V1          Version 1
1   2       22      xcd a2asd_V3    2021-06-13      V3          Version 3
2   3       33      23vg aabsd_V1   2021-06-13      V1          Version 1
3   4       44      dfgdf_aabsd_V0  2021-06-14      V0          Version 0
4   5       55      a3as d_V1       2021-06-15      V1          Version 1
5   60      60      aa bsd_V3       2021-06-15      V3          Version 3
6   0       1       aasd_V4         2021-06-13      V4          Version 4
7   0       5       aabsd_V4        2021-06-16      V4          Version 4
8   6       6       aa_adn sd_V10   2021-06-13      V10         Version 10
9   3       3       NaN             2021-06-13                  
10  2       2       aasd_V12        2021-06-13      V12         Version 12
11  4       4       aasd120Abs      2021-06-16      

Is it possible to do that? I know in SQL we can do that using "LIKE"

WHEN `Name` LIKE '%V10%' THEN 'Verison 10'.

Is there a similar command or any other way to do that in python?

Thanks in advance! Any help is appreciated!

CodePudding user response:

Use str.extract with a regex and str.replace to rename values:

dff['Version_short'] = dff['Name'].str.extract('_(V\d )$').fillna('')
dff['Version_long'] = dff['Version_short'].str.replace('V', 'Version ')

Output:

>>> dff
    col1  col3            Name        Date Version_short Version_long
0      1     1  2a df a1asd_V1  2021-06-13            V1    Version 1
1      2    22    xcd a2asd_V3  2021-06-13            V3    Version 3
2      3    33   23vg aabsd_V1  2021-06-13            V1    Version 1
3      4    44  dfgdf_aabsd_V0  2021-06-14            V0    Version 0
4      5    55      a3as  d_V1  2021-06-15            V1    Version 1
5     60    60       aa bsd_V3  2021-06-15            V3    Version 3
6      0     1         aasd_V4  2021-06-13            V4    Version 4
7      0     5        aabsd_V4  2021-06-16            V4    Version 4
8      6     6   aa_adn sd_V15  2021-06-13           V15   Version 15
9      3     3             NaN  2021-06-13                           
10     2     2        aasd_V12  2021-06-13           V12   Version 12
11     4     4      aasd120Abs  2021-06-16                           

CodePudding user response:

You can use str.extract and a short regex (_(V\d )$):

dff['Version'] = dff['Name'].str.extract('_(V\d )$')
dff['Version_long'] = 'Version ' dff['Version'].str[1:]

NB. I named the second column differently as it is problematic (although not fully impossible) to have two columns with the same name

output:

    col1  col3            Name        Date Version Version_long
0      1     1  2a df a1asd_V1  2021-06-13      V1    Version 1
1      2    22    xcd a2asd_V3  2021-06-13      V3    Version 3
2      3    33   23vg aabsd_V1  2021-06-13      V1    Version 1
3      4    44  dfgdf_aabsd_V0  2021-06-14      V0    Version 0
4      5    55      a3as  d_V1  2021-06-15      V1    Version 1
5     60    60       aa bsd_V3  2021-06-15      V3    Version 3
6      0     1         aasd_V4  2021-06-13      V4    Version 4
7      0     5        aabsd_V4  2021-06-16      V4    Version 4
8      6     6   aa_adn sd_V15  2021-06-13     V15   Version 15
9      3     3             NaN  2021-06-13     NaN          NaN
10     2     2        aasd_V12  2021-06-13     V12   Version 12
11     4     4      aasd120Abs  2021-06-16     NaN          NaN

CodePudding user response:

Use str.split and str.replace in df.assign in a one liner

  dff = dff.assign(Version_short=dff['Name'].str.split('([A-Z0-9] $)').str[-2].fillna(''),Version_lon=dff['Name'].str.split('([A-Z0-9] $)').str[-2].replace('^\D', 'Version ', regex=True).fillna(''))



       col1  col3            Name        Date Version_short Version_lon
0      1     1  2a df a1asd_V1  2021-06-13            V1   Version 1
1      2    22    xcd a2asd_V3  2021-06-13            V3   Version 3
2      3    33   23vg aabsd_V1  2021-06-13            V1   Version 1
3      4    44  dfgdf_aabsd_V0  2021-06-14            V0   Version 0
4      5    55      a3as  d_V1  2021-06-15            V1   Version 1
5     60    60       aa bsd_V3  2021-06-15            V3   Version 3
6      0     1         aasd_V4  2021-06-13            V4   Version 4
7      0     5        aabsd_V4  2021-06-16            V4   Version 4
8      6     6   aa_adn sd_V15  2021-06-13           V15  Version 15
9      3     3             NaN  2021-06-13                          
10     2     2        aasd_V12  2021-06-13           V12  Version 12
11     4     4      aasd120Abs  2021-06-16         

             
  • Related