Home > Blockchain >  Loop through a dataframe checking for a string match
Loop through a dataframe checking for a string match

Time:02-16

first time poster and python beginner so hopefully asking correctly.

I need to loop through all rows in a dataframe, checking for a string match in one column. If there is a match then I want to insert a date into a new column, if not then use a different date.

I need to iterate through the rows as each time the condition is met I want to advance the date by one day.

For now I'm just trying to make sure the string match is working so I'm just inserting two different strings depending on the match result. The problem is that my condition seem to never be met. I can't see where its going wrong.

My code so far is

for index in df.index:
    if df.loc[index, 'time'] == '00:00':
        df['date'] = "wooo"
    else:
        df['date'] = "blah"

The result I'm seeing is in the image below. As you can see it is not picking up the row that matches the condition.

enter image description here

CodePudding user response:

df['date'] = "wooo" sets the entire column to "wooo", which I'm sure is not your intention. You should just set that row's date column

for index in df.index:
    if df.loc[index, 'time'] == '00:00':
        df.loc[index, 'date'] = "wooo"
    else:
        df.loc[index, 'date'] = "blah"

Better yet, use logical indexing and set multiple rows at once instead of iterating over rows and setting each value individually:

df['date'] = 'blah'                 # Set everything to blah
zero_rows = df['time'] == '00:00'   # Find the rows where time is 00:00
df.loc[zero_rows, 'date'] = 'wooo'  # Set these rows to wooo

For example, with

df = pd.DataFrame([['00:00', 2, 3], ['00:01', 5, 6], ['00:02', 8, 9], ['00:00', 10, 11]], columns=['time', 'b', 'c'])

we get:

    time   b   c  date
0  00:00   2   3  wooo
1  00:01   5   6  blah
2  00:02   8   9  blah
3  00:00  10  11  wooo

Re. your comment:

The issue with the logical indexing is that eventually I’m trying to assign a date which counts forwards at the start of each new day (time = 00:00), so until a 00:00 is found insert todays date in each row. Once found, insert today 1 and continue until found again and then insert today 2 etc. The number of rows until the first 00:00 is found varies depending on when the dataframe is created.

This is easily implemented. Let's start with a longer dataframe:

df = pd.DataFrame([['00:00', 2, 3], ['12:00', 5, 6], ['00:00', 8, 9], ['12:00', 10, 11], ['00:00', 18, 19], ['12:00', 110, 111], ['00:00', 28, 29], ['12:00', 210, 211]], columns=['time', 'b', 'c'])

Let's pull out all the columns with time == '00:00'.

zero_rows = df['time'] == '00:00'
midnights = df.loc[zero_rows, :]

    time   b   c
0  00:00   2   3
2  00:00   8   9
4  00:00  18  19
6  00:00  28  29

Now, create a new column 'increment' and fill it with nan. Then, fill only the midnight values with numbers:

if df.iloc[0]['time'] == '00:00':
    start_increment = 0 # First row is midnight, so we start with an increment of zero
else:
    start_increment = 1 # First row is not midnight, so the first midnight row has an increment of 1

df['increment'] = pd.NA
df.loc[midnights.index, 'increment'] = range(start_increment, len(midnights) start_increment)

    time    b    c increment
0  00:00    2    3         0
1  12:00    5    6      <NA>
2  00:00    8    9         1
3  12:00   10   11      <NA>
4  00:00   18   19         2
5  12:00  110  111      <NA>
6  00:00   28   29         3
7  12:00  210  211      <NA>

Next, we can use ffill() to fill values in the increment column that are NA.

df['increment'].ffill(inplace=True)

    time    b    c  increment
0  00:00    2    3          0
1  12:00    5    6          0
2  00:00    8    9          1
3  12:00   10   11          1
4  00:00   18   19          2
5  12:00  110  111          2
6  00:00   28   29          3
7  12:00  210  211          3

Any NAs that still remain were before the first midnight, so they should be zeros:

df['increment'].fillna(0, inplace=True)

Now, we want to add a column 'date' which has a value of today increment days. To do this, we first convert the increment column to pd.TimeDelta:

df['increment'] = pd.to_timedelta(df['increment'], unit='D')

    time    b    c increment
0  00:00    2    3    0 days
1  12:00    5    6    0 days
2  00:00    8    9    1 days
3  12:00   10   11    1 days
4  00:00   18   19    2 days
5  12:00  110  111    2 days
6  00:00   28   29    3 days
7  12:00  210  211    3 days

Finally, add today's date to the increment column:

import datetime
df['date'] = datetime.datetime.today()   df['increment']

    time    b    c increment       date
0  00:00    2    3    0 days 2022-02-15
1  12:00    5    6    0 days 2022-02-15
2  00:00    8    9    1 days 2022-02-16
3  12:00   10   11    1 days 2022-02-16
4  00:00   18   19    2 days 2022-02-17
5  12:00  110  111    2 days 2022-02-17
6  00:00   28   29    3 days 2022-02-18
7  12:00  210  211    3 days 2022-02-18

CodePudding user response:

You can use apply with a Pythonic ternary-expression inside a lambda.

Try in the Python shell:

>>> import pandas as pd
>>> df = pd.DataFrame([['00:00', 2, 3], ['00:01', 5, 6], ['00:02', 8, 9], ['00:00', 10, 11]], columns=['time', 'b', 'c'])
>>> df['date'] = df.time.apply(lambda t: 'wooo' if t == '00:00' else 'blah')
>>> df
    time   b   c  date
0  00:00   2   3  wooo
1  00:01   5   6  blah
2  00:02   8   9  blah
3  00:00  10  11  wooo
  • Related