Python pandas create new column with string code based on boolean rows-CodePudding

I have a dataframe with multiple columns containing booleans/ints(1/0). I need a new result pandas column with strings that are built by following code: How many times the True's are consecutive, if the chain is interrupted or not, and from what column to what column the trues are.

For example this is the following dataframe:

    column_1  column_2  column_3  column_4  column_5  column_6  column_7  column_8  column_9  column_10 
0          0         1         0         1         1         1         1         0         0          1
1          0         1         1         0         1         1         1         0         0          1
2          1         1         0         0         0         1         1         0         0          1
3          1         1         1         0         0         0         0         1         1          1
4          1         1         1         0         0         1         0         0         1          1
5          1         1         1         0         0         0         1         1         0          1
6          0         1         1         1         1         1         1         0         1          0

Where the following row for example: 1: [0 1 1 0 1 1 1 0 0 1]

Would result in code string in the column_result: i2/2-3/c2-c3_c5-c7/6 which is build in four segments I can read somewhere in my code later.

Segment 1:

Where 'i' stands for interrupted, if not interrupted would be 'c' for consecutive
2 stands for how many times it found 2 or more consecutive True's,

Segment 2:

The consecutive count of the consecutive group, in this case the first consecutive count is 2, and the second count is 3..

Semgent 3:

The number/id of the column where the first True was found and the column number of where the last True was found of that consecutive True's.

Semgent 4:

Just the total count of Trues in the row.

Another example would be the following row: 6: [0 1 1 1 1 1 1 0 1 0] Would result in code string in the column_result: c1/6/c2-c7/7

The below code is the startcode I used to create the above dataframe that has random int's for bools:

def create_custom_result(df: pd.DataFrame) -> pd.Series:
    return df

def create_dataframe() -> pd.DataFrame:
    df = pd.DataFrame()  # empty df

    for i in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]:   # create random bool/int values
        df[f'column_{i}'] = np.random.randint(2, size=50)

    df["column_result"] = ''    # add result column
    return df

if __name__=="__main__":

    df = create_dataframe()
    custom_results = create_custom_result(df=df)

Would someone have any idea of how to tackle this? To be honest I have no idea where to start. I found the following that probably came closest: count sets of consecutive true values in a column, however, it uses the column and not the rows horizontal. Maybe someone can tell me if I should try np.array stuff, or maybe pandas has some function that can help me? I found some groupby functions that work horizontal, but I wouldnt know how to convert that to the string code to be used in the result column. Or should I loop through the Dataframe by rows and then build the column_result code in segments?

Thanks in advance!

I tried some things already, looping through the dataframe row by row, but had no idea how to build a new column with the code strings.

I also found this artikel: pandas groupby .. but wouldnt know how to create a new column str data by the group I found. Also, almost everything I find is group stuff by the single column and not through the rows of all columns.

CodePudding user response：

these codes maybe works ?

df = pd.DataFrame(np.random.randint(0,2, size=(12,8)))
df.columns=["col1","col2","col3","col4","col5","col6","col7","col8"]

def func(df:pd.DataFrame) -> pd.DataFrame:
    result_list = []
    copy = df.copy()
    cumsum = copy.cumsum(axis=1)

    for r,s in cumsum.iterrows():    
        count = 0
        last = -1
        interrupted = 0
        consecutive = 0
        consecutives = []    
        ranges = []   

        for x in s.values:
            count  = 1
            if x != 0:
                if x!=last:
                    consecutive  = 1
                    last = x            
                    if consecutive == 2:
                        ranges.append(count-1)
                elif x==last:
                    if consecutive > 1:
                        interrupted  = 1
                        ranges.append(count-1) 
                        consecutives.append(str(consecutive))
                    consecutive = 0
        else:
            if consecutive > 1:
                consecutives.append(str(consecutive))
                ranges.append(count)                

        result = f'{interrupted}i/{len(consecutives)}c/{"-".join(consecutives)}/{"_".join([ f"c{ranges[i]}-c{ranges[i 1]}" for i in range(0,len(ranges),2) ])}/{last}'
        result_list.append(result.split("/"))

    copy["results"] = pd.Series(["/".join(i) for i in result_list])
    copy[["interrupts_count","consecutives_count","consecutives lengths","consecutives columns ranges","total"]] = pd.DataFrame(np.array(result_list))
    return copy

result_df = func(df)

CodePudding user response：

Maybe go with simple class for each column that will receive series from original DataFrame (i.e. sliced vertically) and new value. Using original DataFrame sliced vertical array calculate all starting values as fields (start of consecutive true values, length of consecutive true values, last value..). And finally using start values and new next value update fields and prepare string output.