I have a below Dataframe
Col1 Col2 Col3
A ABC 100
B BCD 200
C CDA 300
D CDA 400
E CDA 500
F EFG 600
G XYZ 700
H XYZ 800
I PQR 900
We can see that in Col2 CDA is repeated 3 times and XYZ is repeated 2 times.
Now I need create the new column NewCol4 and check for the consecutive values, if there are consecutive values just add underscore at the last and add the number after underscore, if it is not consecutive value leave as it is.
I need the output in the below format.
Col1 Col2 Col3 NewCol4
A ABC 100 ABC
B BCD 200 BCD
C CDA 300 CDA_1
D CDA 400 CDA_2
E CDA 500 CDA_3
F EFG 600 EFG
G XYZ 700 XYZ_1
H XYZ 800 XYZ_2
I PQR 900 PQR
I am very new to Pandas, is there any way to achieve the above output that it would be really helpful.
CodePudding user response:
Pretty sure that my answer is not the best idea and there should be some single-line approaches as well, but you can use:
checkNumber = [0]
checkValue = [df["Col2"].values[0]]
def toApply(value):
if sum(df["Col2"] == value) > 1:
if checkValue[0] == value:
checkNumber[0] = 1
return value "_" str(checkNumber[0])
else:
checkNumber[0] = 0
checkValue[0] = value
print(value)
return toApply(value)
else:
return value
df["NewCol4"] = df["Col2"].apply(toApply)
df
Output
Col1 | Col2 | Col3 | NewCol4 | |
---|---|---|---|---|
0 | A | ABC | 100 | ABC |
1 | B | BCD | 200 | BCD |
2 | C | CDA | 300 | CDA_1 |
3 | D | CDA | 400 | CDA_2 |
4 | E | CDA | 500 | CDA_3 |
5 | F | EFG | 600 | EFG |
6 | G | XYZ | 700 | XYZ_1 |
7 | H | XYZ | 800 | XYZ_2 |
8 | I | PQR | 900 | PQR |
CodePudding user response:
A couple of "one-liners" together (?) can bring you all the way, but surely this could be optimized:
# Assign a number to each consecutive group of equal "Col2" values
df["g_rank"] = (df["Col2"] != df["Col2"].shift()).cumsum()
# Add a column with the size of each group
df = df.join(df.groupby("g_rank")["Col2"].size().rename("size"), on="g_rank")
# Now add a different number to each member of a group
df["l_rank"] = df.groupby("g_rank")["size"].rank("first")
# Finally, calculate all the "NewCol4" and remove the unnecesary ones
df["NewCol4"] = df["Col2"] "_" df["l_rank"].astype(int).astype(str)
df.loc[df["size"] <= 1, "NewCol4"] = None
I left some columns on purpose for make it easier to understand each step. This should be the output:
Col1 Col2 Col3 g_rank size l_rank NewCol4
0 A ABC 100 1 1 1.0 None
1 B BCD 200 2 1 1.0 None
2 C CDA 300 3 3 1.0 CDA_1
3 D CDA 400 3 3 2.0 CDA_2
4 E CDA 500 3 3 3.0 CDA_3
5 F EFG 600 4 1 1.0 None
6 G XYZ 700 5 2 1.0 XYZ_1
7 H XYZ 800 5 2 2.0 XYZ_2
8 I PQR 900 6 1 1.0 None
CodePudding user response:
Maybe something like this could help you
# get repeating values
counts = df['Col2'].value_counts()
multi = counts.index[counts>1]
# Get a counter for each Col2 value
new = df.groupby('Col2').apply(pd.DataFrame.reset_index)
new = new.reset_index(1)
new['level_1'] = 1
# Build a NewCol4 with a suffix for all
new['NewCol4'] = new['Col2'] "_" new['level_1'].astype(str)
# Undo those not needed
stay = ~new['Col2'].isin(multi)
new.loc[stay, "NewCol4"] = new.loc[stay, "Col2"]
# remove unneeded columns
new.set_index('index', drop=True, inplace=True)
new.pop('level_1')
new
This is not yet very optimized, but could help as a starting point.
CodePudding user response:
use a groupby and the df.loc to create a NewCol4 . The groupby results can have an apply with function used to return dataframes index hierarchy and list of list values of the groupby resultset.
data="""Col1 Col2 Col3 NewCol4
A ABC 100 ABC
B BCD 200 BCD
C CDA 300 CDA_1
D CDA 400 CDA_2
E CDA 500 CDA_3
F EFG 600 EFG
G XYZ 700 XYZ_1
H XYZ 800 XYZ_2
I PQR 900 PQR"""
df = pd.read_csv(StringIO(data), sep="\s ")
grouped=df.groupby('Col2')['Col3']
index=[]
values=[]
def count_consecutive(df):
index.append(df.index)
values.append(df.values)
grouped.apply(count_consecutive)
#[print(x) for x in index]
#[print(x) for x in values]
for x in index:
count=0
old_value=0
for i in x:
field=df.loc[i,'Col2']
value=df.loc[i,'Col3']
#print(value)
if value>old_value:
count =1
df.loc[i,'NewCol4']=field "_" str(count)
old_value=value
print(df)
output:
Col1 Col2 Col3 NewCol4
0 A ABC 100 ABC_1
1 B BCD 200 BCD_1
2 C CDA 300 CDA_1
3 D CDA 400 CDA_2
4 E CDA 500 CDA_3
5 F EFG 600 EFG_1
6 G XYZ 700 XYZ_1
7 H XYZ 800 XYZ_2
8 I PQR 900 PQR_1
CodePudding user response:
Here is another "one-liner" version:
# Create NewCol4 by adding increments to col2 for those groups that have more than one element
df2 = (df.assign(NewCol4 = df['Col2'] '_' (df.groupby('Col2').cumcount() 1).astype(str))
.groupby('Col2').filter(lambda g:len(g)>1)
)
# Join to the original. This will leave NaNs in NewCol4 for single-element groups
df3 = df.join(df2['NewCol4'])
# Fill NaNa from Col2
df3['NewCol4'] = df3['NewCol4'].fillna(df['Col2'])
df3
now looks like this
Col1 Col2 Col3 NewCol4
-- ------ ------ ------ ---------
0 A ABC 100 ABC
1 B BCD 200 BCD
2 C CDA 300 CDA_1
3 D CDA 400 CDA_2
4 E CDA 500 CDA_3
5 F EFG 600 EFG
6 G XYZ 700 XYZ_1
7 H XYZ 800 XYZ_2
8 I PQR 900 PQR