Effective way to regexp match pandas and strip inside df?-CodePudding

Hoping someone on here is kind enough to at least point me in the right direction. Overall, I'm trying to match regex for each row and produce the below output (in 'desired example output').

To elaborate, data is being matched from a 'Device Pool' column from a rather large CSV (all settings from a phone). I need to:

input only the regexp match in the Device Pool column/row and still have it corresponding to the Directory Number 1 data. I may add other columns later
Also strip the D in the regexp as well, as it is only useful for the initial lookup.

Example input data(humongous file with lots of columns):

... ,Device Pool,CSS,forward 1,Directory Number 1, ...
YART01-432-D098-00-1,CSS-bobville-1,12223041234,12228675309
BART-1435-C512-00-1,CSS-willis-3,12223041234,12228215486
HOMER-1435-A134-00-1,CSS-willis-2,12223041238,12228212345
VAR05-1435-D099-00-1,CSS-willis-2,12223041897,12228215486
POA01-4-D100-00-1,CSS-hmbrgr-chz,12223043151,12228843454
...

Tried a few different approaches to no avail. with findall, I'd have to add the other columns back I guess (doesn't seem very efficient). It was pulling the data but not the other associated columns pertaining to the row. I since dropped that direction. Surely there is a cleaner way, that might even drop the need to filter first. This is where I'm at:

df1 = pd.read.csv(some_file.csv)
dff1 = df1.filter(items=['Device Pool', 'Directory Number 1']))
df2 = dff1.loc[d1.iloc[:,0].str.contains('[D][0-9][0-9][0-9]', regex=True)]
dff2 = # stuck here

current example output:

Device Pool Directory Number 1
YART01-432-D098-00-1 12228675309
VAR05-1435-D099-00-1 12228215486
POA01-4-D100-00-1 12228843454
...

desired example output:

Device Pool Directory Number 1
098 12228675309
099 12228215486
100 12228843454

...

I'll be using these trimmed numbers to reference an address code csv, then pulling coordinates from geo location code, to then map. Pretty fun project really.

CodePudding user response：

You can use

df['Device Pool'] = df['Device Pool'].str.replace(r'.*-D(\d ).*', r'\1', regex=True)

Or, with Series.str.extract:

df['Device Pool'] = df['Device Pool'].str.extract(r'-D(\d )', expand=False)

See a Pandas test:

import pandas as pd
df = pd.DataFrame({'Device Pool':['YART01-432-D098-00-1', 'VAR05-1435-D099-00-1', 'POA01-4-D100-00-1'], 'Directory Number 1':['12228675309', '12228215486', '12228843454']})
df['Device Pool'].str.replace(r'.*-D(\d ).*', r'\1', regex=True)
>>> df
  Device Pool Directory Number 1
0         098        12228675309
1         099        12228215486
2         100        12228843454

The .*-D(\d ).* regex matches

.* - any zero or more chars other than line break chars as many as possible
-D - a -D string
(\d ) - Group 1: one or more digits
.* - the rest of the line.