Specify locations in pandas.DataFrame using combination of integer and string-CodePudding

What I would like

## Sample DataFrame
data = [[0, 0, 0, 0, 1, 0], [0, 1, 0, 0, 0, 1]] 
index = ['Item1', 'Item2']
columns = ['20220130', '20220131', '20220201', '20220202', '20220203', '20220204']
df = pd.DataFrame(data, index=index, columns=columns)

print(df)
## Output
#       20220130  20220131  20220201  20220202  20220203  20220204
#Item1         0         0         0         0         1         0
#Item2         0         1         0         0         0         1

The column means "date". I would like to change 0 to -1 if values are 0 three or more days in a row.

print(df)
## Expected Output
#       20220130  20220131  20220201  20220202  20220203  20220204
#Item1        -1        -1        -1        -1         1         0
#Item2         0         1        -1        -1        -1         1

What I did

I tried to read values one by one and find where to update (0 to -1).
The problem is (date - 2). Is there a way to specify column locations using integers and names?

for item, row in df.iterrows():
    count = 0
    for date, value in row.iteritems():
        if value == 0:
            count  = 1
        else:
            count = 0
        if count >= 3:
            df.loc[item, (date - 2):date]
## Output
# TypeError: unsupported operand type(s) for -: 'str' and 'int'

Premises

I have other use cases unrelated to dates, so do not convert dates to datetime objects and use timedelta. Here I would like to know how I can specify columns like column_name - 2.

Environment

Python 3.10.5
Pandas 1.4.3

CodePudding user response：

For that to run, you need to cast date to integer, say df.loc[item, str(int(date) - 2):date], but the elif: block is wrong, and df.loc used like that would allow you to replace only one element.

I had a similar problem some time ago and solved it with a windowed list. This way you don't have to locate the specific element to change and you don't rely on any convention for your columns names. In short, iterate on the rows, make a windowed list of each row and replace the window with you replacement (in this case, [-1,-1,-1]) if the window contain all zeros, otherwise let it unaltered.

Here is the function to create a windowed list:

    def windowed_list(
        original_list: Iterable[Any],
        window_size: int,
        minimum_window_size: int = None,
        disjunct: bool = True,
    ) -> List[List[str]]:
    
        """Creates a windowed copy of the original_list

        Args:
            original_list (Iterable): list to window
            window_size (int): size of the windows
            minimum_window_size(int): minimum size of the windows. All windows will have size greater or equal than this parameter. Set to any number lesser than 2 to keep all windows.
            disjunct (bool, optional): whether to produce disjunct windows or not.
                If false, produces windows which differ for one element from previous and next windows.
                Example: [[0,1,2], [1,2,3], [3,4,5]] if False, [[0,1,2], [3,4,5]] if True.
                Defaults to True.

        Returns:
            List[List[str]]: _description_
        """
        if minimum_window_size is None:
            minimum_window_size = window_size

        if minimum_window_size > window_size:
            raise ValueError(
                f"minimum_window_size={minimum_window_size} > window_size={window_size}"
            )

        windowed_list = (
            [
                original_list[i : i   window_size]
                for i in range(len(original_list) - len(original_list) % window_size)
            ]
            if window_size <= len(original_list)
            else original_list
        )
        if minimum_window_size:
            windowed_list = [
                window for window in windowed_list if len(window) >= minimum_window_size
            ]

        return windowed_list[::window_size] if disjunct else windowed_list

And here it is applied to your problem:

    data = [[0, 0, 0, 0, 1, 0], [0, 1, 0, 0, 0, 1]] 
    index = ['Item1', 'Item2']
    columns = ['20220130', '20220131', '20220201', '20220202', '20220203', '20220204']
    df = pd.DataFrame(data, index=index, columns=columns)


    window_size = 3
    for row_index, row in df.iterrows():
        copy_row = row.copy() # This copy here is necessary because pandas is peculiar at overwriting, so I solved by copying the row, modifying the copy and lastly overwriting the row
        windowed_row = windowed_list(row, window_size, disjunct=False)
        for idx, window in enumerate(windowed_row):
            if sum([val == 0 for val in window]) == window_size:
                copy_row[idx:idx window_size] = [-1]*window_size 
        df.loc[row_index] = copy_row #overwrite the entire row
    print(df)

Output:

          20220130  20220131  20220201  20220202  20220203  20220204
Item1        -1        -1        -1        -1         1         0
Item2         0         1        -1        -1        -1         1

I tested this approach with a couple cases and it seems to work, but please tell me if you find a case where it doesn't work.

CodePudding user response：

You can stack, then use a custom groupby to group the successive identical values per ID, and get the size of the group. After unstack use this as a mask:

N = 3
s = df.rename_axis('index').stack()
m = s.groupby(['index', s.ne(s.shift()).cumsum()]).transform('size').gt(N).unstack()

out = df.mask(m, -1)

output:

       20220130  20220131  20220201  20220202  20220203  20220204
Item1        -1        -1        -1        -1         1         0
Item2         0         1         0         0         0         1

CodePudding user response：

Another possible solution, based on the following ideas:

Convert each row to a string.
Use regex to replace all sequences of zeros (length of 3 or more) with sequences of nines.
map the string to a list of integers (previously replacing 9 by -1).
assign that to the dataframe.

import re

def repl(m):
    return '9' * len(m.group())

df.iloc[:] = pd.DataFrame.from_records(
  df.apply(lambda x:
    [*map(
      lambda y: -1 if y == '9' else int(y),
      [*re.sub(r"0{3}0*", repl, ''.join(map(str, x)))])],
    axis = 1))

df

#>        20220130  20220131  20220201  20220202  20220203  20220204
#> Item1        -1        -1        -1        -1         1         0
#> Item2         0         1        -1        -1        -1         1