Home > database >  Filter rows of strings after a starting row that starts with certain characters
Filter rows of strings after a starting row that starts with certain characters

Time:08-10

I'm using Python. I have extracted text from pdf. So I have a data frame full of strings with just one column and no column name. I need to filter rows from a starting row until the end. This starting row is identified because starts with certain characters. Consider the following example:

 ---------------- 
|   aaaaaaa      |
|   bbbbbb       |
|   ccccccc      |
|   hellodddd    |
|   eeeeeeeee    |
|   fffffffffff  |
|   gggggggg     |
|   hhhhhhhh     |
 ---------------- 

I need to filter rows from the starting row, which is hellodddd until the end. As you can see, the starting row is identified because startswith hello characters. So, the expected output is:

 ---------------- 
|   hellodddd    |
|   eeeeeeeee    |
|   fffffffffff  |
|   gggggggg     |
|   hhhhhhhh     |
 ---------------- 

I think this example can be reproduced with the following code:

mylist = ['aaaaaaa', 'bbbbbb', 'ccccccc', 'hellodddd', 'eeeeeeeee', 'fffffffffff', 'gggggggg', 'hhhhhhhh']
df = pd.DataFrame(mylist)

I think I need to use startswith() function first to identify the starting row. But, then, what could I do to select the wanted columns (the ones that follow the starting row until the end)?

CodePudding user response:

.startswith() is a method on a string, returning whether or not a string starts with some substring, it won't help you select rows in a dataframe (unless you're looking for the first row with a value that starts with that string).

You're looking for something like:

import pandas as pd

mylist = ['aaaaaaa', 'bbbbbb', 'ccccccc', 'hellodddd', 'eeeeeeeee', 'fffffffffff', 'hellodddd', 'hhhhhhhh']
df = pd.DataFrame(mylist)

print(df[(df[0].values == 'hellodddd').argmax():])

Result:

             0
3    hellodddd
4    eeeeeeeee
5  fffffffffff
6    hellodddd
7     hhhhhhhh

Note that I replaced a later value with 'hellodddd' as well, to show that it will include all rows from the first match onwards.

Edit: in response to the comment:

import pandas as pd

mylist = ['aaaaaaa', 'bbbbbb', 'ccccccc', 'hellodddd', 'eeeeeeeee', 'fffffffffff', 'hellodddd', 'hhhhhhhh']
df = pd.DataFrame(mylist)

print(df[(df[0].str.startswith('hello')).argmax():])

Result is identical.

CodePudding user response:

I don't know much about panda, but I know that itertools can solve this problem:

import itertools

mylist = [
    'aaaaaaa', 'bbbbbb', 'ccccccc', 'hellodddd', 'eeeeeeeee',
    'fffffffffff', 'gggggggg', 'hhhhhhhh'
]

result = list(itertools.dropwhile(
    lambda element: not element.startswith("hello"),
    mylist,
))

The dropwhile function drop (discard) those element that fits the condition, after that, it returns the rest.

  • Related