I need to create a new column that counts the number of leading 0s, however I am getting errors trying to do so.
I extracted data from mongo based on the following regex [\^0[0]*[1-9][0-9]*\]
on mongo and saved it to a csv file. This is all "Sequences" that start with a 0.
df['Sequence'].str.count('0')
and
df['Sequence'].str.count('0[0]*[1-9][0-9]')
Give the below results. As you can see that both of the "count" string return will also count non leading 0s. Or simply the total number of 0s.
Sequence 0s
0 012312312 1
1 024624624 1
2 036901357 2
3 002486248 2
4 045074305 3
5 080666140 3
I also tried writing using loops which worked when testing but when using it on the data frame, I encounter the following **IndexError: string index out of range**
results = []
count = 0
index = 0
for item in df['Sequence']:
count = 0
index = 0
while (item[index] == "0"):
count = count 1
index = index 1
results.append(count)
df['0s'] = results
df
In short; If I can get 2 for 001230 substring instead of 3. I could save the results in a column to do my stats on.
CodePudding user response:
You can use extract
with the ^(0*)
regex to match only the leading zeros. Then use str.len
to get the length.
df['0s'] = df['sequence'].str.extract('^(0*)', expand = False).str.len()
Example input:
df = pd.DataFrame({'sequence': ['12040', '01230', '00010', '00120']})
Output:
sequence 0s
0 12040 0
1 01230 1
2 00010 3
3 00120 2
CodePudding user response:
You can use this regex:
'^0 '
the ^
means, capture if the pattern starts at the beginning of the string.
the
means, capture if occuring at least once or multiple times.
CodePudding user response:
IIUC, you want to count the number of leading 0s, right? Take advantage of the fact that leading 0s disappear when an integer of type str
is converted to that of type int
. Here's one solution:
df['leading 0s'] = df['Sequence'].str.len() - df['Sequence'].astype(int).astype(str).str.len()
Output:
Sequence leading 0s
0 012312312 1
1 024624624 1
2 036901357 1
3 002486248 2
4 045074305 1
5 080666140 1
CodePudding user response:
Try str.findall
:
df['0s'] = df['Sequence'].str.findall('^0*').str[0].str.len()
print(df)
# Output:
Sequence 0s
0 012312312 1
1 024624624 1
2 036901357 1
3 002486248 2
4 045074305 1
5 080666140 1