How to limit strings in Pandas row by row using apply and lambda?-CodePudding

I have the following dataframe:

# initialize list of lists
data = [['1', "Tag1, Tag323, Tag36"], ['2', "Tag11, Tag212"], ['4', "Tag1, Tag12, Tag3, Tag324"]]
 
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['ID', 'Tag'])

print(df)

    ID  Tag
    1   Tag1, Tag323, Tag36
    2   Tag11, Tag212
    4   Tag1, Tag12, Tag3, Tag324

I would like to manipulate the string values (e.g. "Tag1, Tag2, Tag3") in column tag with the follwing condition. In each row, if there are more than 2 tags, the output should look like "Tag1, Tag2 ..". The tag length can be different.

print(df)

    ID  Tag
    1   Tag1, Tag323 ..
    2   Tag11, Tag212
    4   Tag1, Tag12 ..

Does anyone know a Pandas apply and lambda method to solve this?

CodePudding user response：

Idea is split tags by Series.str.split and if length is greater like N select first N values and join:

N = 2
s = df['Tag'].str.split(",")
df['Tag'] = df['Tag'].mask(s.str.len().gt(N), s.str[:N].str.join(",")   "...")
print (df)
  ID              Tag
0  1  Tag1, Tag323...
1  2    Tag11, Tag212
2  4   Tag1, Tag12...

Solution with apply:

N = 2
df['Tag'] = df['Tag'].apply(lambda x: ', '.join(x.split(',')[:N])   '...'
                            if x.count(',')   1 > N else x)

Or:

N = 2
df['Tag'] = df['Tag'].mask(df['Tag'].str.count(',').gt(N), 
                           df['Tag'].apply(lambda x: ', '.join(x.split(',')[:N])   '...'))

CodePudding user response：

len_ = len('Tag1, Tag2, Tag3')    
df['Tag'] = [x if len(x)<len_ else "Tag1, Tag2 .." for x in df['Tag']  ]

You can choose your own length.

Output:

    ID  Tag
0   1   Tag1, Tag2 ..
1   2   Tag1, Tag2
2   4   Tag1, Tag2 ..

CodePudding user response：

I would do it following way

import re
import pandas as pd
data = [['1', "Tag1, Tag323, Tag36"], ['2', "Tag11, Tag212"], ['4', "Tag1, Tag12, Tag3, Tag324"]] 
df = pd.DataFrame(data, columns = ['ID', 'Tag'])
def shorten(text):
    return re.sub(r'(\S ,\s \S ),\s \S .*', r'\1..', text)
df['Tag'] = df['Tag'].apply(shorten)
print(df)

gives output

  ID             Tag
0  1  Tag1, Tag323..
1  2   Tag11, Tag212
2  4   Tag1, Tag12..

Explanation: I use re.sub function with capturing group, meaning of symbol is as follows: (...) - capturing group \S - non-whitespace character \s - whitespace character - one-or-more repetitions . - zero or more repetitions . - any character, , - literal comma. If found text does containg regular expression, i.e. at least 3 non-whitespace runs, then it is replace by two first runs and what is between them at two dots, other text remain unaffected.

CodePudding user response：

Just use this simple code:

for i in df.Tag:
    a = i.split(",")
    if len(a)>2:
        b = a[0:2]
        c = ",".join(b) ".."
        df["Tag"] = c

CodePudding user response：

The ol’ split and join Methods might just work here. I’ll write the whole function below but this can be converted into a lambda function as well

def tag_splitter(text):
    split_text = text.split(',')
    if len(split_text) > 2:
        return ','.join(split_text[:2])   '..'
    return ','.join(split_text)

When finished, apply this to your column (like df[‘tag’].apply(tag_splitter))

Also, the quotation marks appear wrong on my screen (but I can’t currently change them). Make sure to use the proper single quotes