- The first dataframe I have is the following:
String1 |
---|
Table 671usa50452.tab has been created as of the process date (12-19-22). |
Table 643usa50552.tab has been created as of the process date (12-19-22). |
Table 681usa50532.tab has been created as of the process date (12-19-22). |
Table 621usa56452.tab has been created as of the process date (12-19-22). |
Table 547usa67452.tab has been created as of the process date (12-19-22). |
I would like to extract all the accounts that contain 'usa' in between and the date specified in each row to have something like this:
String1 | Account | Date |
---|---|---|
Table 671usa50452.tab has been created as of the process date (12-19-22). | 671usa50452 | 12-19-22 |
Table 643usa50552.tab has been created as of the process date (12-19-22). | 643usa50552 | 12-19-22 |
Table 681usa50532.tab has been created as of the process date (12-19-22). | 681usa50532 | 12-19-22 |
Table 621usa56452.tab has been created as of the process date (12-19-22). | 621usa56452 | 12-19-22 |
Table 547usa67452.tab has been created as of the process date (12-19-22). | 547usa67452 | 12-19-22 |
I've been trying to use the following, but the information is not getting into the columns of my new dataframe:
df_list1[['Account', 'Date']] = df_list1['String'].str.extract(r'\b(\d usa\d )\b.*?(\d{2}-\d{2}-\d{4})')
- The second dataframe is something similar:
String2 |
---|
3203usa34088 : Asset USA1 / asd011245 |
3203usa34088 : Asset USA2 / ghf023345 |
3203usa34088 : Asset USA3 / hgf012735 |
3203usa34088 : Asset USA4 / wet012455 |
3203usa34088 : Asset USA5 / nbj012245 |
And I would like to have the following:
String2 | Account2 |
---|---|
3200usa34088 : Asset USA1 / asd011245 | 3200usa34088 |
3201usa34088 : Asset USA2 / ghf023345 | 3201usa34088 |
3202usa34088 : Asset USA3 / hgf012735 | 3202usa34088 |
3203usa34088 : Asset USA4 / wet012455 | 3203usa34088 |
3204usa34088 : Asset USA5 / nbj012245 | 3204usa34088 |
CodePudding user response:
For the first dataframe we can use str.extract
as follows:
df["Account"] = df["String1"].str.extract(r'(\w )\.tab\b')
df["Date"] = df["String1"].str.extract(r'\((\d{2}-\d{2}-\d{2})\)')
For the second dataframe:
df["Account2"] = df["String2"].str.extract(r'^(\w )')
CodePudding user response:
I think this can work:
# Pandas lib
import pandas as pd
# -------------------------------------------------------------- FIRST DATAFRAME
# I will suppose youre importing the df from excel ok?
df1 = pd.read_excel("First_df.xlsx")
#First case:
list_account = []
list_date = []
for string in df1['String1']:
if "usa" in string:
new_string = string.split()
newnew_string = new_string[1].split(".")
date_string = new_string[10].split("(")
datedate_string = date_string[0].split(")")
list_account.append(newnew_string[0])
list_date.append(datedate_string[0])
df_output = pd.DataFrame({'Account': list_account})
df_output['Date'] = list_date
# -------------------------------------------------------------- SECOND DATAFRAME
df2 = pd.read_excel("Second_df.xlsx")
list_account2 = []
for string in df2['String2']:
if "usa" in string:
new_string = string.split()
list_account2.append(new_string[0])
df_output2 = pd.DataFrame({'Account2': list_account2})
CodePudding user response:
Answer for the First use case:
l=[]
l2=[]
for i in range(len(df)):
l.append(df.string1.tolist()[i].split(" ")[1])
s=(df.string1.tolist()[j].split(" ")[10])
l2.append(s[s.find("(") 1:s.find(")")])
df['Account']=l
df['Date']=l2
output :
string1 Account
Date
0 Table 671usa50452.tab has been created as of t... 671usa50452.tab 12-19-22
1 Table 643usa50552.tab has been created as of t... 643usa50552.tab 12-19-22
2 Table 681usa50532.tab has been created as of t... 681usa50532.tab 12-19-22
3 Table 621usa56452.tab has been created as of t... 621usa56452.tab 12-19-22
4 Table 547usa67452.tab has been created as of t... 547usa67452.tab 12-19-22
for the second :
l3=[]
for i in range(len(df)):
l.append(df.string1.tolist()[i].split(" ")[0])
df['Account2']=l3