Home > database >  Extract first sequence of strings in pandas column
Extract first sequence of strings in pandas column

Time:11-25

I have a column in a DF as below

| Column A       |
| ab, bce, bc    |
| bc, abcd, ab   | 
| ab, cd, abc    | 

and i want to create a new column that only takes the first sequence, as showed below

| Column A       | Column B |
| ab, bce, bc    | ab       |
| bc, abcd, ab   | bc       |
| ab, cd, abc    | ab       |

I tried with this code but it only gives me the first letter of the first sequence, not the entire abbrevation

df.loc[:, 'ColumnB'] = df.ColumnA.map(lambda x: x[0])

CodePudding user response:

I guess the items in columnA are strings like e.g. 'ab, bce, bc', so just use split ;).

df.loc[:, 'ColumnB'] = df.ColumnA.map(lambda x: x.split(',')[0])

CodePudding user response:

You can alos try vectorised str method split and use integer indexing on the list to get the first element:

df['Column B'] = df['Column A'].str.split(',').str[0]

Should gives

Column A       Column B 
ab, bce, bc    ab       
bc, abcd, ab   bc       
ab, cd, abc    ab       

CodePudding user response:

You're close, you just need to convert strings to lists with pandas.Series.split before the map :

df["Column B"]= df["Column A"].str.split(",").map(lambda x: x[0])

You can also use pandas.Series.get :

df["Column B"]= df["Column A"].str.split(",").str.get(0)

Another option is list comprehension:

df["Column B"]= [el[0] for el in df["Column A"].str.split(",")]

# Output :

print(df)

       Column A Column B
0   ab, bce, bc       ab
1  bc, abcd, ab       bc
2   ab, cd, abc       ab

CodePudding user response:

So,the row is treated as string and you are getting the first index of string "ab,bce,bc".

You need to convert that to a list and then take the first element which will be "ab" now.

df.loc[:, 'ColumnB'] = df.ColumnA.map(lambda x: x.split(",")[0])

This creates "ColumnB" as you require.

Hope it helps!

CodePudding user response:

If you want the first chunk, don't split. Instead extract the initial non , characters. This will be more efficient:

df['Column B'] = df['Column A'].str.extract('([^,] )')

Output:

       Column A Column B
0   ab, bce, bc       ab
1  bc, abcd, ab       bc
2   ab, cd, abc       ab
  • Related