I'm trying to consolidate the strings below for lack of a better term - I would like to identify all examples of "BB" and change them to "BB", same for all examples of "B" to "B" and the same for all CCC examples to make all categories uniform. Below is the list of all unique examples in the list:
array(['BB ', 'B-', 'BB', 'BB-', 'CCC ', 'N.A.', 'CCC', 'B ', 'B', None,
'BB *-', 'B- * ', 'CC *-', 'B *-', 'BB- *-', 'NR', 'CCC-', 'BBB-',
'B * ', 'BB * ', 'B * ', 'BB * ', 'D', 'BB *-', 'CCC * ',
'BBB', 'C *-', 'CCC *-', 'BB- * ', 'CCC * ', 'FLD UNKNO',
'(P)CCC ', '(P)B ', 'C', 'CCC- *-', 'BBB- *-'], dtype=object)
So for example, all 'BB ' would be changed to 'BB'
Same for examples like 'BB * ' would be converted to just 'BB'
I had been using
.str.find('BB').replace(0,'BB')
On the particular column of the dataframe, but the problem is all of the B /-/flat and CCC examples are identified as -1 so I can't differentiate from there.
The ideal output would be:
BB , BB, BB-, any iteration of those with * , *-, (P), etc would output to BB
B , B, B-, any iteration of those * , *-, (P), etc would output to B
CCC , CCC, CCC- any iteration of those with * , *-, (P) would output to CCC
CodePudding user response:
for i in range(len(npArr)):
if 'BBB' in npArr[i]:
npArr[i] = 'BBB'
elif 'BB' in npArr[i]:
npArr[i] = 'BB'
elif 'B' in npArr[i]:
npArr[i] = 'B'
elif 'CCC' in npArr[i]:
npArr[i] = 'CCC'
I had to change a couple things to make this work. Firstly, I changed the dtype of the array to dtype=str
. I'm not sure if that's a breaking change for you, but otherwise it would require more work. Secondly, I set the array to the variable npArr
.
Here is the output:
['BB' 'B' 'BB' 'BB' 'CCC' 'N.A.' 'CCC' 'B' 'B' 'None' 'BB' 'B' 'CC *-' 'B'
'BB' 'NR' 'CCC' 'BBB' 'B' 'BB' 'B' 'BB' 'D' 'BB' 'CCC' 'BBB' 'C *-' 'CCC'
'BB' 'CCC' 'FLD UNKNO' 'CCC' 'B' 'C' 'CCC' 'BBB']
CodePudding user response:
If lst
is your list from the question, then this example changes B
, BB
and CCC
values:
import re
pat = re.compile(r"\b(?:B{1,2}|CCC)\b")
out = []
for item in lst:
if not isinstance(item, str):
out.append(item)
continue
m = pat.search(item)
if m:
out.append(m.group(0))
else:
out.append(item)
for a, b in zip(out, lst):
print(b, a, sep="\t")
Prints:
BB BB
B- B
BB BB
BB- BB
CCC CCC
N.A. N.A.
CCC CCC
B B
B B
None None
BB *- BB
B- * B
CC *- CC *-
B *- B
BB- *- BB
NR NR
CCC- CCC
BBB- BBB-
B * B
BB * BB
B * B
BB * BB
D D
BB *- BB
CCC * CCC
BBB BBB
C *- C *-
CCC *- CCC
BB- * BB
CCC * CCC
FLD UNKNO FLD UNKNO
(P)CCC CCC
(P)B B
C C
CCC- *- CCC
BBB- *- BBB- *-