Count common prefixes by input value in python-CodePudding

I have an input.txt file.

Information: The file 'U_1A/U_2A/U_3C/U_4D' will be found (TBD-001)
Information: The file 'U_1A/U_2A/U_3C/U_4D' will be found (TBD-001)
Information: The file 'U_1A/U_2A/U_3C/U_4D' will be found (TBD-001)
Information: The file 'U_1A/U_2B/U_3C/U_4D' will be found (TBD-001)
Information: The file 'U_1A/U_2C/U_3C/U_4D' will be found (TBD-001)
Information: The file 'U_1A/U_2C/U_3C/U_4A' will be found (TBD-001)
Information: The file 'U_1B/U_2A/U_3C/U_4B' will be found (TBD-001)
Information: The file 'U_1B/U_2B/U_3C/U_4D' will be found (TBD-001)
Information: The file 'U_1C/U_2B/U_3C/U_4A' will be found (TBD-001)
Information: The file 'U_1C/U_2B/U_3C/U_4D' will be found (TBD-001)

what I really want to do is counting the number of contents inside the single quote in descending order.

So, When I execute the python code, User should put certain number. and that input number means the depth of the hierarchy. for example, When I put 2, then the output should be like this

U_1A/U_2A : 3
U_1A/U_2C : 2
U_1C/U_2B : 2
U_1A/U_2B : 1
U_1B/U_2A : 1
U_1B/U_2B : 1

and when I put 3, the output be like this.

U_1A/U_2A/U_3C : 3
U_1A/U_2C/U_3C : 2
U_1C/U_2B/U_3C : 2
U_1A/U_2B/U_3C : 1
U_1B/U_2A/U_3C : 1
U_1B/U_2B/U_3C : 1

so on so forth.

In order to achieve this, I managed to sort out the list of the string inside '' by the following code.

text = open("input.txt", "r")

# Loop through each line of the file
for line in text:
    # Remove the leading spaces and newline character
    line = line.strip()
    #print(line)
    # Split the line into words
    words = line.split(' ')
    word = words[3]
    print (word)

CodePudding user response：

You can do this quickly in pandas. Here the file is read, separated by spaces, and a dataframe is created. Each line is divided into an array by the separator '/', the specified number is taken in a loop and the unique values are counted. The array is then concatenated back into a string using join.

import pandas as pd

df = pd.read_csv('df2.txt', sep = ' ', header=None)

for i in range(1, 5):
    a = df[3].str.split("/").str[:i].str.join('/')
    print(a.value_counts())

Input

              0    1     2                      3     4   5      6          7
0  Information:  The  file  'U_1A/U_2A/U_3C/U_4D'  will  be  found  (TBD-001)
1  Information:  The  file  'U_1A/U_2A/U_3C/U_4D'  will  be  found  (TBD-001)
2  Information:  The  file  'U_1A/U_2A/U_3C/U_4D'  will  be  found  (TBD-001)
3  Information:  The  file  'U_1A/U_2B/U_3C/U_4D'  will  be  found  (TBD-001)
4  Information:  The  file  'U_1A/U_2C/U_3C/U_4D'  will  be  found  (TBD-001)
5  Information:  The  file  'U_1A/U_2C/U_3C/U_4A'  will  be  found  (TBD-001)
6  Information:  The  file  'U_1B/U_2A/U_3C/U_4B'  will  be  found  (TBD-001)
7  Information:  The  file  'U_1B/U_2B/U_3C/U_4D'  will  be  found  (TBD-001)
8  Information:  The  file  'U_1C/U_2B/U_3C/U_4A'  will  be  found  (TBD-001)
9  Information:  The  file  'U_1C/U_2B/U_3C/U_4D'  will  be  found  (TBD-001)

Output

'U_1A    6
'U_1C    2
'U_1B    2
Name: 3, dtype: int64
'U_1A/U_2A    3
'U_1A/U_2C    2
'U_1C/U_2B    2
'U_1B/U_2B    1
'U_1B/U_2A    1
'U_1A/U_2B    1
Name: 3, dtype: int64
'U_1A/U_2A/U_3C    3
'U_1C/U_2B/U_3C    2
'U_1A/U_2C/U_3C    2
'U_1B/U_2B/U_3C    1
'U_1B/U_2A/U_3C    1
'U_1A/U_2B/U_3C    1
Name: 3, dtype: int64
'U_1A/U_2A/U_3C/U_4D'    3
'U_1A/U_2C/U_3C/U_4D'    1
'U_1B/U_2A/U_3C/U_4B'    1
'U_1A/U_2C/U_3C/U_4A'    1
'U_1C/U_2B/U_3C/U_4A'    1
'U_1C/U_2B/U_3C/U_4D'    1
'U_1B/U_2B/U_3C/U_4D'    1
'U_1A/U_2B/U_3C/U_4D'    1

As far as I understand, you need to count bla1, bla2, bla3. That solution is the following, no loop is needed, specify the desired slice, in this case it is [:4].

a = df[3].str.split("/").str[:4].str.join('/').value_counts()
print(a)

If you need to skip an element, then the solution is somewhat more complicated. For example, you need: bla1, bla2 and bla4. Break each value into a separate column. We use fancy indexing to select necessary. In 'c' again glue back the selected elements and count.

b = df[3].str.split("/", expand=True)
c = b[[0, 1, 3]].apply(lambda row: '/'.join(row.values.astype(str)), axis=1)
print(c.value_counts())

Output

'U_1A/U_2A/U_4D'    3
'U_1A/U_2C/U_4A'    1
'U_1A/U_2B/U_4D'    1
'U_1C/U_2B/U_4D'    1
'U_1B/U_2A/U_4B'    1
'U_1A/U_2C/U_4D'    1
'U_1B/U_2B/U_4D'    1
'U_1C/U_2B/U_4A'    1