How to Find the consecutive values in Pandas DataFrame column and replace the value-CodePudding

I have a below Dataframe

Col1    Col2    Col3
A       ABC     100
B       BCD     200
C       CDA     300
D       CDA     400
E       CDA     500
F       EFG     600
G       XYZ     700
H       XYZ     800
I       PQR     900

We can see that in Col2 CDA is repeated 3 times and XYZ is repeated 2 times.

Now I need create the new column NewCol4 and check for the consecutive values, if there are consecutive values just add underscore at the last and add the number after underscore, if it is not consecutive value leave as it is.

I need the output in the below format.

Col1    Col2    Col3   NewCol4
A       ABC     100    ABC
B       BCD     200    BCD
C       CDA     300    CDA_1
D       CDA     400    CDA_2
E       CDA     500    CDA_3
F       EFG     600    EFG
G       XYZ     700    XYZ_1
H       XYZ     800    XYZ_2
I       PQR     900    PQR

I am very new to Pandas, is there any way to achieve the above output that it would be really helpful.

CodePudding user response：

Pretty sure that my answer is not the best idea and there should be some single-line approaches as well, but you can use:

checkNumber = [0]
checkValue = [df["Col2"].values[0]]
def toApply(value):
  if sum(df["Col2"] == value) > 1:
    if checkValue[0] == value:
      checkNumber[0]  = 1
      return value   "_"   str(checkNumber[0])
    else:
      checkNumber[0] = 0
      checkValue[0] = value
      print(value)
      return toApply(value)
  else:
    return value
df["NewCol4"] = df["Col2"].apply(toApply)
df

Output

	Col1	Col2	Col3	NewCol4
0	A	ABC	100	ABC
1	B	BCD	200	BCD
2	C	CDA	300	CDA_1
3	D	CDA	400	CDA_2
4	E	CDA	500	CDA_3
5	F	EFG	600	EFG
6	G	XYZ	700	XYZ_1
7	H	XYZ	800	XYZ_2
8	I	PQR	900	PQR

CodePudding user response：

A couple of "one-liners" together (?) can bring you all the way, but surely this could be optimized:

# Assign a number to each consecutive group of equal "Col2" values
df["g_rank"] = (df["Col2"] != df["Col2"].shift()).cumsum()

# Add a column with the size of each group
df = df.join(df.groupby("g_rank")["Col2"].size().rename("size"), on="g_rank")

# Now add a different number to each member of a group
df["l_rank"] = df.groupby("g_rank")["size"].rank("first")

# Finally, calculate all the "NewCol4" and remove the unnecesary ones
df["NewCol4"] = df["Col2"]   "_"   df["l_rank"].astype(int).astype(str)
df.loc[df["size"] <= 1, "NewCol4"] = None

I left some columns on purpose for make it easier to understand each step. This should be the output:

  Col1 Col2  Col3  g_rank  size  l_rank NewCol4
0    A  ABC   100       1     1     1.0    None
1    B  BCD   200       2     1     1.0    None
2    C  CDA   300       3     3     1.0   CDA_1
3    D  CDA   400       3     3     2.0   CDA_2
4    E  CDA   500       3     3     3.0   CDA_3
5    F  EFG   600       4     1     1.0    None
6    G  XYZ   700       5     2     1.0   XYZ_1
7    H  XYZ   800       5     2     2.0   XYZ_2
8    I  PQR   900       6     1     1.0    None

CodePudding user response：

Maybe something like this could help you

# get repeating values
counts = df['Col2'].value_counts()
multi = counts.index[counts>1]
# Get a counter for each Col2 value
new = df.groupby('Col2').apply(pd.DataFrame.reset_index)
new = new.reset_index(1)
new['level_1']  = 1
# Build a NewCol4 with a suffix for all
new['NewCol4'] = new['Col2']   "_"   new['level_1'].astype(str)
# Undo those not needed
stay = ~new['Col2'].isin(multi)
new.loc[stay, "NewCol4"] = new.loc[stay, "Col2"]
# remove unneeded columns
new.set_index('index', drop=True, inplace=True)
new.pop('level_1')
new

This is not yet very optimized, but could help as a starting point.

CodePudding user response：

use a groupby and the df.loc to create a NewCol4 . The groupby results can have an apply with function used to return dataframes index hierarchy and list of list values of the groupby resultset.

data="""Col1    Col2    Col3   NewCol4
A       ABC     100    ABC
B       BCD     200    BCD
C       CDA     300    CDA_1
D       CDA     400    CDA_2
E       CDA     500    CDA_3
F       EFG     600    EFG
G       XYZ     700    XYZ_1
H       XYZ     800    XYZ_2
I       PQR     900    PQR"""

df = pd.read_csv(StringIO(data), sep="\s ")

grouped=df.groupby('Col2')['Col3']

index=[]
values=[]
def count_consecutive(df):
    index.append(df.index)
    values.append(df.values)

grouped.apply(count_consecutive)        
#[print(x) for x in index]
#[print(x) for x in values]

for x in index:
    count=0
    old_value=0
    for i in x:
        field=df.loc[i,'Col2']
        value=df.loc[i,'Col3']
        #print(value)
        if value>old_value:
            count =1
        df.loc[i,'NewCol4']=field "_" str(count)
        old_value=value
    
print(df)

output:

Col1 Col2  Col3 NewCol4
0    A  ABC   100   ABC_1
1    B  BCD   200   BCD_1
2    C  CDA   300   CDA_1
3    D  CDA   400   CDA_2
4    E  CDA   500   CDA_3
5    F  EFG   600   EFG_1
6    G  XYZ   700   XYZ_1
7    H  XYZ   800   XYZ_2
8    I  PQR   900   PQR_1

CodePudding user response：

Here is another "one-liner" version:

# Create NewCol4 by adding increments to col2 for those groups that have more than one element
df2 = (df.assign(NewCol4 = df['Col2']   '_'   (df.groupby('Col2').cumcount() 1).astype(str))
         .groupby('Col2').filter(lambda g:len(g)>1)
)

# Join to the original. This will leave NaNs in NewCol4 for single-element groups
df3 = df.join(df2['NewCol4'])

# Fill NaNa from Col2
df3['NewCol4'] = df3['NewCol4'].fillna(df['Col2'])

df3 now looks like this

    Col1    Col2      Col3  NewCol4
--  ------  ------  ------  ---------
 0  A       ABC        100  ABC
 1  B       BCD        200  BCD
 2  C       CDA        300  CDA_1
 3  D       CDA        400  CDA_2
 4  E       CDA        500  CDA_3
 5  F       EFG        600  EFG
 6  G       XYZ        700  XYZ_1
 7  H       XYZ        800  XYZ_2
 8  I       PQR        900  PQR