What I would like
## Sample DataFrame
data = [[0, 0, 0, 0, 1, 0], [0, 1, 0, 0, 0, 1]]
index = ['Item1', 'Item2']
columns = ['20220130', '20220131', '20220201', '20220202', '20220203', '20220204']
df = pd.DataFrame(data, index=index, columns=columns)
print(df)
## Output
# 20220130 20220131 20220201 20220202 20220203 20220204
#Item1 0 0 0 0 1 0
#Item2 0 1 0 0 0 1
The column means "date". I would like to change 0 to -1 if values are 0 three or more days in a row.
print(df)
## Expected Output
# 20220130 20220131 20220201 20220202 20220203 20220204
#Item1 -1 -1 -1 -1 1 0
#Item2 0 1 -1 -1 -1 1
What I did
I tried to read values one by one and find where to update (0 to -1).
The problem is (date - 2)
. Is there a way to specify column locations using integers and names?
for item, row in df.iterrows():
count = 0
for date, value in row.iteritems():
if value == 0:
count = 1
else:
count = 0
if count >= 3:
df.loc[item, (date - 2):date]
## Output
# TypeError: unsupported operand type(s) for -: 'str' and 'int'
Premises
I have other use cases unrelated to dates, so do not convert dates to datetime
objects and use timedelta
. Here I would like to know how I can specify columns like column_name - 2
.
Environment
Python 3.10.5
Pandas 1.4.3
CodePudding user response:
For that to run, you need to cast date to integer, say df.loc[item, str(int(date) - 2):date]
, but the elif:
block is wrong, and df.loc
used like that would allow you to replace only one element.
I had a similar problem some time ago and solved it with a windowed list. This way you don't have to locate the specific element to change and you don't rely on any convention for your columns names. In short, iterate on the rows, make a windowed list of each row and replace the window with you replacement (in this case, [-1,-1,-1]
) if the window contain all zeros, otherwise let it unaltered.
Here is the function to create a windowed list:
def windowed_list(
original_list: Iterable[Any],
window_size: int,
minimum_window_size: int = None,
disjunct: bool = True,
) -> List[List[str]]:
"""Creates a windowed copy of the original_list
Args:
original_list (Iterable): list to window
window_size (int): size of the windows
minimum_window_size(int): minimum size of the windows. All windows will have size greater or equal than this parameter. Set to any number lesser than 2 to keep all windows.
disjunct (bool, optional): whether to produce disjunct windows or not.
If false, produces windows which differ for one element from previous and next windows.
Example: [[0,1,2], [1,2,3], [3,4,5]] if False, [[0,1,2], [3,4,5]] if True.
Defaults to True.
Returns:
List[List[str]]: _description_
"""
if minimum_window_size is None:
minimum_window_size = window_size
if minimum_window_size > window_size:
raise ValueError(
f"minimum_window_size={minimum_window_size} > window_size={window_size}"
)
windowed_list = (
[
original_list[i : i window_size]
for i in range(len(original_list) - len(original_list) % window_size)
]
if window_size <= len(original_list)
else original_list
)
if minimum_window_size:
windowed_list = [
window for window in windowed_list if len(window) >= minimum_window_size
]
return windowed_list[::window_size] if disjunct else windowed_list
And here it is applied to your problem:
data = [[0, 0, 0, 0, 1, 0], [0, 1, 0, 0, 0, 1]]
index = ['Item1', 'Item2']
columns = ['20220130', '20220131', '20220201', '20220202', '20220203', '20220204']
df = pd.DataFrame(data, index=index, columns=columns)
window_size = 3
for row_index, row in df.iterrows():
copy_row = row.copy() # This copy here is necessary because pandas is peculiar at overwriting, so I solved by copying the row, modifying the copy and lastly overwriting the row
windowed_row = windowed_list(row, window_size, disjunct=False)
for idx, window in enumerate(windowed_row):
if sum([val == 0 for val in window]) == window_size:
copy_row[idx:idx window_size] = [-1]*window_size
df.loc[row_index] = copy_row #overwrite the entire row
print(df)
Output:
20220130 20220131 20220201 20220202 20220203 20220204
Item1 -1 -1 -1 -1 1 0
Item2 0 1 -1 -1 -1 1
I tested this approach with a couple cases and it seems to work, but please tell me if you find a case where it doesn't work.
CodePudding user response:
You can stack
, then use a custom groupby
to group the successive identical values per ID, and get the size of the group. After unstack
use this as a mask
:
N = 3
s = df.rename_axis('index').stack()
m = s.groupby(['index', s.ne(s.shift()).cumsum()]).transform('size').gt(N).unstack()
out = df.mask(m, -1)
output:
20220130 20220131 20220201 20220202 20220203 20220204
Item1 -1 -1 -1 -1 1 0
Item2 0 1 0 0 0 1
CodePudding user response:
Another possible solution, based on the following ideas:
Convert each row to a string.
Use regex to replace all sequences of zeros (length of 3 or more) with sequences of nines.
map the string to a list of integers (previously replacing 9 by -1).
assign that to the dataframe.
import re
def repl(m):
return '9' * len(m.group())
df.iloc[:] = pd.DataFrame.from_records(
df.apply(lambda x:
[*map(
lambda y: -1 if y == '9' else int(y),
[*re.sub(r"0{3}0*", repl, ''.join(map(str, x)))])],
axis = 1))
df
#> 20220130 20220131 20220201 20220202 20220203 20220204
#> Item1 -1 -1 -1 -1 1 0
#> Item2 0 1 -1 -1 -1 1