I'm using Python. I have extracted text from pdf. So I have a data frame full of strings with just one column and no column name. I need to filter rows from a starting row until the end. This starting row is identified because starts with certain characters. Consider the following example:
----------------
| aaaaaaa |
| bbbbbb |
| ccccccc |
| hellodddd |
| eeeeeeeee |
| fffffffffff |
| gggggggg |
| hhhhhhhh |
----------------
I need to filter rows from the starting row, which is hellodddd
until the end. As you can see, the starting row is identified because startswith hello
characters.
So, the expected output is:
----------------
| hellodddd |
| eeeeeeeee |
| fffffffffff |
| gggggggg |
| hhhhhhhh |
----------------
I think this example can be reproduced with the following code:
mylist = ['aaaaaaa', 'bbbbbb', 'ccccccc', 'hellodddd', 'eeeeeeeee', 'fffffffffff', 'gggggggg', 'hhhhhhhh']
df = pd.DataFrame(mylist)
I think I need to use startswith()
function first to identify the starting row. But, then, what could I do to select the wanted columns (the ones that follow the starting row until the end)?
CodePudding user response:
.startswith()
is a method on a string, returning whether or not a string starts with some substring, it won't help you select rows in a dataframe (unless you're looking for the first row with a value that starts with that string).
You're looking for something like:
import pandas as pd
mylist = ['aaaaaaa', 'bbbbbb', 'ccccccc', 'hellodddd', 'eeeeeeeee', 'fffffffffff', 'hellodddd', 'hhhhhhhh']
df = pd.DataFrame(mylist)
print(df[(df[0].values == 'hellodddd').argmax():])
Result:
0
3 hellodddd
4 eeeeeeeee
5 fffffffffff
6 hellodddd
7 hhhhhhhh
Note that I replaced a later value with 'hellodddd'
as well, to show that it will include all rows from the first match onwards.
Edit: in response to the comment:
import pandas as pd
mylist = ['aaaaaaa', 'bbbbbb', 'ccccccc', 'hellodddd', 'eeeeeeeee', 'fffffffffff', 'hellodddd', 'hhhhhhhh']
df = pd.DataFrame(mylist)
print(df[(df[0].str.startswith('hello')).argmax():])
Result is identical.
CodePudding user response:
I don't know much about panda, but I know that itertools
can solve this problem:
import itertools
mylist = [
'aaaaaaa', 'bbbbbb', 'ccccccc', 'hellodddd', 'eeeeeeeee',
'fffffffffff', 'gggggggg', 'hhhhhhhh'
]
result = list(itertools.dropwhile(
lambda element: not element.startswith("hello"),
mylist,
))
The dropwhile
function drop (discard) those element that fits the condition, after that, it returns the rest.