Extracting specific substring from a list of strings-CodePudding

I'm trying to consolidate the strings below for lack of a better term - I would like to identify all examples of "BB" and change them to "BB", same for all examples of "B" to "B" and the same for all CCC examples to make all categories uniform. Below is the list of all unique examples in the list:

array(['BB ', 'B-', 'BB', 'BB-', 'CCC ', 'N.A.', 'CCC', 'B ', 'B', None,
       'BB *-', 'B- * ', 'CC *-', 'B  *-', 'BB- *-', 'NR', 'CCC-', 'BBB-',
       'B  * ', 'BB  * ', 'B * ', 'BB * ', 'D', 'BB  *-', 'CCC  * ',
       'BBB', 'C *-', 'CCC  *-', 'BB- * ', 'CCC * ', 'FLD UNKNO',
       '(P)CCC ', '(P)B ', 'C', 'CCC- *-', 'BBB- *-'], dtype=object)

So for example, all 'BB ' would be changed to 'BB'

Same for examples like 'BB * ' would be converted to just 'BB'

I had been using

.str.find('BB').replace(0,'BB')

On the particular column of the dataframe, but the problem is all of the B /-/flat and CCC examples are identified as -1 so I can't differentiate from there.

The ideal output would be:

BB , BB, BB-, any iteration of those with * , *-, (P), etc would output to BB

B , B, B-, any iteration of those * , *-, (P), etc would output to B

CCC , CCC, CCC- any iteration of those with * , *-, (P) would output to CCC

CodePudding user response：

for i in range(len(npArr)):
  if 'BBB' in npArr[i]:
    npArr[i] = 'BBB'
  elif 'BB' in npArr[i]:
    npArr[i] = 'BB'
  elif 'B' in npArr[i]:
    npArr[i] = 'B'
  elif 'CCC' in npArr[i]:
    npArr[i] = 'CCC'

I had to change a couple things to make this work. Firstly, I changed the dtype of the array to dtype=str. I'm not sure if that's a breaking change for you, but otherwise it would require more work. Secondly, I set the array to the variable npArr.

Here is the output:

['BB' 'B' 'BB' 'BB' 'CCC' 'N.A.' 'CCC' 'B' 'B' 'None' 'BB' 'B' 'CC *-' 'B'
 'BB' 'NR' 'CCC' 'BBB' 'B' 'BB' 'B' 'BB' 'D' 'BB' 'CCC' 'BBB' 'C *-' 'CCC'
 'BB' 'CCC' 'FLD UNKNO' 'CCC' 'B' 'C' 'CCC' 'BBB']

CodePudding user response：

If lst is your list from the question, then this example changes B, BB and CCC values:

import re

pat = re.compile(r"\b(?:B{1,2}|CCC)\b")

out = []
for item in lst:
    if not isinstance(item, str):
        out.append(item)
        continue

    m = pat.search(item)
    if m:
        out.append(m.group(0))
    else:
        out.append(item)

for a, b in zip(out, lst):
    print(b, a, sep="\t")

Prints:

BB      BB
B-      B
BB      BB
BB-     BB
CCC     CCC
N.A.    N.A.
CCC     CCC
B       B
B       B
None    None
BB *-   BB
B- *    B
CC *-   CC *-
B  *-   B
BB- *-  BB
NR      NR
CCC-    CCC
BBB-    BBB-
B  *    B
BB  *   BB
B *     B
BB *    BB
D       D
BB  *-  BB
CCC  *  CCC
BBB     BBB
C *-    C *-
CCC  *- CCC
BB- *   BB
CCC *   CCC
FLD UNKNO       FLD UNKNO
(P)CCC  CCC
(P)B    B
C       C
CCC- *- CCC
BBB- *- BBB- *-