I have an input.txt file.
Information: The file 'U_1A/U_2A/U_3C/U_4D' will be found (TBD-001)
Information: The file 'U_1A/U_2A/U_3C/U_4D' will be found (TBD-001)
Information: The file 'U_1A/U_2A/U_3C/U_4D' will be found (TBD-001)
Information: The file 'U_1A/U_2B/U_3C/U_4D' will be found (TBD-001)
Information: The file 'U_1A/U_2C/U_3C/U_4D' will be found (TBD-001)
Information: The file 'U_1A/U_2C/U_3C/U_4A' will be found (TBD-001)
Information: The file 'U_1B/U_2A/U_3C/U_4B' will be found (TBD-001)
Information: The file 'U_1B/U_2B/U_3C/U_4D' will be found (TBD-001)
Information: The file 'U_1C/U_2B/U_3C/U_4A' will be found (TBD-001)
Information: The file 'U_1C/U_2B/U_3C/U_4D' will be found (TBD-001)
what I really want to do is counting the number of contents inside the single quote in descending order.
So, When I execute the python code, User should put certain number. and that input number means the depth of the hierarchy. for example, When I put 2, then the output should be like this
U_1A/U_2A : 3
U_1A/U_2C : 2
U_1C/U_2B : 2
U_1A/U_2B : 1
U_1B/U_2A : 1
U_1B/U_2B : 1
and when I put 3, the output be like this.
U_1A/U_2A/U_3C : 3
U_1A/U_2C/U_3C : 2
U_1C/U_2B/U_3C : 2
U_1A/U_2B/U_3C : 1
U_1B/U_2A/U_3C : 1
U_1B/U_2B/U_3C : 1
so on so forth.
In order to achieve this, I managed to sort out the list of the string inside '' by the following code.
text = open("input.txt", "r")
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()
#print(line)
# Split the line into words
words = line.split(' ')
word = words[3]
print (word)
CodePudding user response:
You can do this quickly in pandas. Here the file is read, separated by spaces, and a dataframe is created. Each line is divided into an array by the separator '/', the specified number is taken in a loop and the unique values are counted. The array is then concatenated back into a string using join.
import pandas as pd
df = pd.read_csv('df2.txt', sep = ' ', header=None)
for i in range(1, 5):
a = df[3].str.split("/").str[:i].str.join('/')
print(a.value_counts())
Input
0 1 2 3 4 5 6 7
0 Information: The file 'U_1A/U_2A/U_3C/U_4D' will be found (TBD-001)
1 Information: The file 'U_1A/U_2A/U_3C/U_4D' will be found (TBD-001)
2 Information: The file 'U_1A/U_2A/U_3C/U_4D' will be found (TBD-001)
3 Information: The file 'U_1A/U_2B/U_3C/U_4D' will be found (TBD-001)
4 Information: The file 'U_1A/U_2C/U_3C/U_4D' will be found (TBD-001)
5 Information: The file 'U_1A/U_2C/U_3C/U_4A' will be found (TBD-001)
6 Information: The file 'U_1B/U_2A/U_3C/U_4B' will be found (TBD-001)
7 Information: The file 'U_1B/U_2B/U_3C/U_4D' will be found (TBD-001)
8 Information: The file 'U_1C/U_2B/U_3C/U_4A' will be found (TBD-001)
9 Information: The file 'U_1C/U_2B/U_3C/U_4D' will be found (TBD-001)
Output
'U_1A 6
'U_1C 2
'U_1B 2
Name: 3, dtype: int64
'U_1A/U_2A 3
'U_1A/U_2C 2
'U_1C/U_2B 2
'U_1B/U_2B 1
'U_1B/U_2A 1
'U_1A/U_2B 1
Name: 3, dtype: int64
'U_1A/U_2A/U_3C 3
'U_1C/U_2B/U_3C 2
'U_1A/U_2C/U_3C 2
'U_1B/U_2B/U_3C 1
'U_1B/U_2A/U_3C 1
'U_1A/U_2B/U_3C 1
Name: 3, dtype: int64
'U_1A/U_2A/U_3C/U_4D' 3
'U_1A/U_2C/U_3C/U_4D' 1
'U_1B/U_2A/U_3C/U_4B' 1
'U_1A/U_2C/U_3C/U_4A' 1
'U_1C/U_2B/U_3C/U_4A' 1
'U_1C/U_2B/U_3C/U_4D' 1
'U_1B/U_2B/U_3C/U_4D' 1
'U_1A/U_2B/U_3C/U_4D' 1
As far as I understand, you need to count bla1, bla2, bla3. That solution is the following, no loop is needed, specify the desired slice, in this case it is [:4].
a = df[3].str.split("/").str[:4].str.join('/').value_counts()
print(a)
If you need to skip an element, then the solution is somewhat more complicated. For example, you need: bla1, bla2 and bla4. Break each value into a separate column. We use fancy indexing to select necessary. In 'c' again glue back the selected elements and count.
b = df[3].str.split("/", expand=True)
c = b[[0, 1, 3]].apply(lambda row: '/'.join(row.values.astype(str)), axis=1)
print(c.value_counts())
Output
'U_1A/U_2A/U_4D' 3
'U_1A/U_2C/U_4A' 1
'U_1A/U_2B/U_4D' 1
'U_1C/U_2B/U_4D' 1
'U_1B/U_2A/U_4B' 1
'U_1A/U_2C/U_4D' 1
'U_1B/U_2B/U_4D' 1
'U_1C/U_2B/U_4A' 1