Python - count successive leading digits on a pandas row string without counting non successive digi-CodePudding

I need to create a new column that counts the number of leading 0s, however I am getting errors trying to do so. I extracted data from mongo based on the following regex [\^0[0]*[1-9][0-9]*\] on mongo and saved it to a csv file. This is all "Sequences" that start with a 0.

df['Sequence'].str.count('0')

and

df['Sequence'].str.count('0[0]*[1-9][0-9]')

Give the below results. As you can see that both of the "count" string return will also count non leading 0s. Or simply the total number of 0s.

    Sequence    0s
0   012312312   1
1   024624624   1
2   036901357   2
3   002486248   2
4   045074305   3
5   080666140   3

I also tried writing using loops which worked when testing but when using it on the data frame, I encounter the following **IndexError: string index out of range**

results = []
count = 0 
index = 0
for item in df['Sequence']:
    count = 0 
    index = 0
    while (item[index] == "0"):  
            count = count   1          
            index = index   1
    results.append(count)
df['0s'] = results
df

In short; If I can get 2 for 001230 substring instead of 3. I could save the results in a column to do my stats on.

CodePudding user response：

You can use extract with the ^(0*) regex to match only the leading zeros. Then use str.len to get the length.

df['0s'] = df['sequence'].str.extract('^(0*)', expand = False).str.len()

Example input:

df = pd.DataFrame({'sequence': ['12040', '01230', '00010', '00120']})

Output:

  sequence  0s
0    12040   0
1    01230   1
2    00010   3
3    00120   2

CodePudding user response：

You can use this regex:

'^0 '

the ^ means, capture if the pattern starts at the beginning of the string. the means, capture if occuring at least once or multiple times.

CodePudding user response：

IIUC, you want to count the number of leading 0s, right? Take advantage of the fact that leading 0s disappear when an integer of type str is converted to that of type int. Here's one solution:

df['leading 0s'] = df['Sequence'].str.len() - df['Sequence'].astype(int).astype(str).str.len()

Output:

    Sequence  leading 0s
0  012312312           1
1  024624624           1
2  036901357           1
3  002486248           2
4  045074305           1
5  080666140           1

CodePudding user response：

Try str.findall:

df['0s'] = df['Sequence'].str.findall('^0*').str[0].str.len()
print(df)

# Output:
    Sequence  0s
0  012312312   1
1  024624624   1
2  036901357   1
3  002486248   2
4  045074305   1
5  080666140   1