I have the DF as below
col1
GRE MET HOCK 38
ASS COM CORD EMERG INIT
I would like to create a column with first string of each word from col1 but keeping integer, such as below
col1 col2
GRE MET HOCK 38 GMH38
ASS COM CORD EMERG INIT ACCEI
I found something that might work, but not giving the expected output
import re
input = "GRE MET HOCK 38"
output = "".join(item[0].upper() for item in re.findall("\w ", input))
CodePudding user response:
split
the strings on spaces, then stack into one long Series. Then you can take the first letter, except where the split part of the string isnumeric
, and finally join the results and assign back, which aligns on the original DataFrame index.
import pandas as pd
df = pd.DataFrame({'col1': ['GRE MET HOCK 38', 'ASS COM CORD EMERG INIT']})
s = df['col1'].str.split('\s ', expand=True).stack()
df['col2'] = s.str[0].mask(s.str.isnumeric(), s).groupby(level=0).agg(''.join)
col1 col2
0 GRE MET HOCK 38 GMH38
1 ASS COM CORD EMERG INIT ACCEI
CodePudding user response:
You can use Series.str.replace
:
import pandas as pd
df = pd.DataFrame({'col':['GRE MET HOCK 38', 'ASS COM CORD EMERG INIT']})
df['col'].str.replace(r'\b(?!\d \b)(\w)\w*|\s ', lambda x: x.group(1).upper() if x.group(1) else '', regex=True)
# => 0 GMH38
# 1 ACCEI
# Name: col, dtype: object
See the regex demo. Depending on what kind of numbers and what kind of word boundaries you need to support the regex can be adjusted.
The current pattern matches
\b(?!\d \b)(\w)\w*
- a word boundary, and then one word char (captured into Group 1 and then zero or more word chars, but these word chars should not constitute a digit sequence as a whole|
- or\s
- one or more whitespaces.
If Group 1 matches, this uppercased value is the replacement, else, the match is removed (the replacement is an empty string).
CodePudding user response:
you can Iterate on column items and extract words with split items by spaces,then create new words with fist letter of words and save the new word to a list,then add this list to a new column of DataFrame
from pandas import DataFrame
data = {
'col1' : ['GRE MET HOCK 38', 'ASS COM CORD EMERG INIT'],
}
new_column = []
df = DataFrame(data)
for item in df['col1']:
new_item = ""
#extract words from item with split by space
words = item.split()
for word in words:
#add first letter to new item
new_item = word[0]
#add new item to new column
new_column.append(new_item)
#add new column to DataFrame
df['col2'] = new_column
print(df)