Extract specific words from a dataframe

The first dataframe I have is the following:

String1
Table 671usa50452.tab has been created as of the process date (12-19-22).
Table 643usa50552.tab has been created as of the process date (12-19-22).
Table 681usa50532.tab has been created as of the process date (12-19-22).
Table 621usa56452.tab has been created as of the process date (12-19-22).
Table 547usa67452.tab has been created as of the process date (12-19-22).

I would like to extract all the accounts that contain 'usa' in between and the date specified in each row to have something like this:

String1	Account	Date
Table 671usa50452.tab has been created as of the process date (12-19-22).	671usa50452	12-19-22
Table 643usa50552.tab has been created as of the process date (12-19-22).	643usa50552	12-19-22
Table 681usa50532.tab has been created as of the process date (12-19-22).	681usa50532	12-19-22
Table 621usa56452.tab has been created as of the process date (12-19-22).	621usa56452	12-19-22
Table 547usa67452.tab has been created as of the process date (12-19-22).	547usa67452	12-19-22

I've been trying to use the following, but the information is not getting into the columns of my new dataframe:

df_list1[['Account', 'Date']] = df_list1['String'].str.extract(r'\b(\d usa\d )\b.*?(\d{2}-\d{2}-\d{4})')

The second dataframe is something similar:

String2
3203usa34088 : Asset USA1 / asd011245
3203usa34088 : Asset USA2 / ghf023345
3203usa34088 : Asset USA3 / hgf012735
3203usa34088 : Asset USA4 / wet012455
3203usa34088 : Asset USA5 / nbj012245

And I would like to have the following:

String2	Account2
3200usa34088 : Asset USA1 / asd011245	3200usa34088
3201usa34088 : Asset USA2 / ghf023345	3201usa34088
3202usa34088 : Asset USA3 / hgf012735	3202usa34088
3203usa34088 : Asset USA4 / wet012455	3203usa34088
3204usa34088 : Asset USA5 / nbj012245	3204usa34088

CodePudding user response：

For the first dataframe we can use str.extract as follows:

df["Account"] = df["String1"].str.extract(r'(\w )\.tab\b')
df["Date"] = df["String1"].str.extract(r'\((\d{2}-\d{2}-\d{2})\)')

For the second dataframe:

df["Account2"] = df["String2"].str.extract(r'^(\w )')

CodePudding user response：

I think this can work:

# Pandas lib
import pandas as pd


# -------------------------------------------------------------- FIRST DATAFRAME

# I will suppose youre importing the df from excel ok?
df1 = pd.read_excel("First_df.xlsx")

#First case:
list_account = []
list_date = []
for string in df1['String1']:
    if "usa" in string:
        new_string = string.split()
        newnew_string = new_string[1].split(".")
        date_string = new_string[10].split("(")
        datedate_string = date_string[0].split(")")
        
        list_account.append(newnew_string[0])
        list_date.append(datedate_string[0])

df_output = pd.DataFrame({'Account': list_account})
df_output['Date'] = list_date

# -------------------------------------------------------------- SECOND DATAFRAME

df2 = pd.read_excel("Second_df.xlsx")

list_account2 = []

for string in df2['String2']:
    if "usa" in string:
        new_string = string.split()
        list_account2.append(new_string[0])
        
df_output2 = pd.DataFrame({'Account2': list_account2})

CodePudding user response：

Answer for the First use case:

l=[]
l2=[]
for i in range(len(df)):
    l.append(df.string1.tolist()[i].split(" ")[1])
    s=(df.string1.tolist()[j].split(" ")[10])
    l2.append(s[s.find("(") 1:s.find(")")])

df['Account']=l
df['Date']=l2

output :

                                         string1          Account  

    Date
0  Table 671usa50452.tab has been created as of t...  671usa50452.tab  12-19-22
1  Table 643usa50552.tab has been created as of t...  643usa50552.tab  12-19-22
2  Table 681usa50532.tab has been created as of t...  681usa50532.tab  12-19-22
3  Table 621usa56452.tab has been created as of t...  621usa56452.tab  12-19-22
4  Table 547usa67452.tab has been created as of t...  547usa67452.tab  12-19-22

for the second :

l3=[]
for i in range(len(df)):
    l.append(df.string1.tolist()[i].split(" ")[0]) 
df['Account2']=l3