I have the following dataframe:
# initialize list of lists
data = [['1', "Tag1, Tag323, Tag36"], ['2', "Tag11, Tag212"], ['4', "Tag1, Tag12, Tag3, Tag324"]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['ID', 'Tag'])
print(df)
ID Tag
1 Tag1, Tag323, Tag36
2 Tag11, Tag212
4 Tag1, Tag12, Tag3, Tag324
I would like to manipulate the string values (e.g. "Tag1, Tag2, Tag3"
) in column tag
with the follwing condition. In each row, if there are more than 2 tags, the output should look like "Tag1, Tag2 .."
. The tag length can be different.
print(df)
ID Tag
1 Tag1, Tag323 ..
2 Tag11, Tag212
4 Tag1, Tag12 ..
Does anyone know a Pandas apply and lambda method to solve this?
CodePudding user response:
Idea is split tags by Series.str.split
and if length is greater like N
select first N
values and join:
N = 2
s = df['Tag'].str.split(",")
df['Tag'] = df['Tag'].mask(s.str.len().gt(N), s.str[:N].str.join(",") "...")
print (df)
ID Tag
0 1 Tag1, Tag323...
1 2 Tag11, Tag212
2 4 Tag1, Tag12...
Solution with apply
:
N = 2
df['Tag'] = df['Tag'].apply(lambda x: ', '.join(x.split(',')[:N]) '...'
if x.count(',') 1 > N else x)
Or:
N = 2
df['Tag'] = df['Tag'].mask(df['Tag'].str.count(',').gt(N),
df['Tag'].apply(lambda x: ', '.join(x.split(',')[:N]) '...'))
CodePudding user response:
len_ = len('Tag1, Tag2, Tag3')
df['Tag'] = [x if len(x)<len_ else "Tag1, Tag2 .." for x in df['Tag'] ]
You can choose your own length.
Output:
ID Tag
0 1 Tag1, Tag2 ..
1 2 Tag1, Tag2
2 4 Tag1, Tag2 ..
CodePudding user response:
I would do it following way
import re
import pandas as pd
data = [['1', "Tag1, Tag323, Tag36"], ['2', "Tag11, Tag212"], ['4', "Tag1, Tag12, Tag3, Tag324"]]
df = pd.DataFrame(data, columns = ['ID', 'Tag'])
def shorten(text):
return re.sub(r'(\S ,\s \S ),\s \S .*', r'\1..', text)
df['Tag'] = df['Tag'].apply(shorten)
print(df)
gives output
ID Tag
0 1 Tag1, Tag323..
1 2 Tag11, Tag212
2 4 Tag1, Tag12..
Explanation: I use re.sub
function with capturing group, meaning of symbol is as follows: (
...)
- capturing group \S
- non-whitespace character \s
- whitespace character
- one-or-more repetitions .
- zero or more repetitions .
- any character, ,
- literal comma. If found text does containg regular expression, i.e. at least 3 non-whitespace runs, then it is replace by two first runs and what is between them at two dots, other text remain unaffected.
CodePudding user response:
Just use this simple code:
for i in df.Tag:
a = i.split(",")
if len(a)>2:
b = a[0:2]
c = ",".join(b) ".."
df["Tag"] = c
CodePudding user response:
The ol’ split
and join
Methods might just work here. I’ll write the whole function below but this can be converted into a lambda function as well
def tag_splitter(text):
split_text = text.split(',')
if len(split_text) > 2:
return ','.join(split_text[:2]) '..'
return ','.join(split_text)
When finished, apply
this to your column (like df[‘tag’].apply(tag_splitter)
)
Also, the quotation marks appear wrong on my screen (but I can’t currently change them). Make sure to use the proper single quotes