I have an issue where I have multiple rows in a csv file that have to be converted to a pandas data frame but there are some rows where the columns 'name' and 'business' have multiple names and businesses that should be in separate rows and need to be split up while keeping the data from the other columns the same for each row that is split.
Here is the example data:
input:
software | name | business |
---|---|---|
abc | Andrew Johnson, Steve Martin | Outsourcing/Offshoring, 201-500 employees,Health, Wellness and Fitness, 5001-10,000 employees |
xyz | Jack Jones, Rick Paul, Johnny Jones | Banking, 1001-5000 employees,Construction, 51-200 employees,Consumer Goods, 10,001 employees |
def | Tom D., Connie J., Ricky B. | Unspecified, Unspecified, Self-employed |
output I need:
software | name | business |
---|---|---|
abc | Andrew Johnson | Outsourcing/Offshoring, 201-500 employees |
abc | Steve Martin | Health, Wellness and Fitness, 5001-10,000 employees |
xyz | Jack Jones | Banking, 1001-5000 employees |
xyz | Rick Paul | Construction, 51-200 employees |
xyz | Johnny Jones | Consumer Goods, 10,001 employees |
def | Tom D | Unspecified |
def | Connie J | Unspecified |
def | Ricky B | Self-employed |
There are additional columns similar to 'name' and 'business' that contain multiple pieces of information that need to be split up just like 'name' and 'business'. Cells that contain multiple pieces of information are in sequence (ordered).
Here's the code I have so far and creates new rows but it only splits up the contents in name column, but that leaves the business column and a few other columns left over that need to be split up along with the contents from the name column.
name2 = df.name.str.split(',', expand=True).stack()
df = df.join(pd.Series(index=name2.index.droplevel(1), data=name2.values, name = 'name2'))
dict = df.to_dict('record')
for row in dict:
new_segment = {}
new_segment['name'] = str(row['name2'])
#df['name'] = str(row['name2'])
for col,content in new_segment.items():
row[col] = content
df = pd.DataFrame.from_dict(dict)
df = df.drop('name2', 1)
Here's an alternative solution I was trying as well but it gives me an error too:
review_path = r'data/base_data'
review_files = glob.glob(review_path "/test_data.csv")
review_df_list = []
for review_file in review_files:
df = pd.read_csv(io.StringIO(review_file), sep = '\t')
print(df.head())
df["business"] = (df["business"].str.extractall(r"(?:[\s,]*)(.*?(?:Unspecified|employees|Self-employed))").groupby(level=0).agg(list))
df["name"] = df["name"].str.split(r"\s*,\s*")
print(df.explode(["name", "business"]))
outPutPath = Path('data/base_data/test_data.csv')
df.to_csv(outPutPath, index=False)
Error Message for alternative solution:
Read:data/base_data/review_base.csv
Success!
Empty DataFrame
Columns: [data/base_data/test_data.csv]
Index: []
Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'business'
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/Users/taha/Desktop/X2Analytics/project4/tryingfull.py", line 382, in df["business"] = (df["business"].str.extractall(r"(?:[\s,])(.?(?:Unspecified|employees|Self-employed))").groupby(level=0).agg(list)) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in getitem indexer = self.columns.get_loc(key) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc raise KeyError(key) from err KeyError: 'business'
CodePudding user response:
Try:
df["business"] = (
df["business"]
.str.extractall(r"(?:[\s,]*)(.*?(?:Unspecified|employees|Self-employed))")
.groupby(level=0)
.agg(list)
)
df["name"] = df["name"].str.split(r"\s*,\s*")
print(df.explode(["name", "business"]))
Prints:
software name business
0 abc Andrew Johnson Outsourcing/Offshoring, 201-500 employees
0 abc Steve Martin Health, Wellness and Fitness, 5001-10,000 employees
1 xyz Jack Jones Banking, 1001-5000 employees
1 xyz Rick Paul Construction, 51-200 employees
1 xyz Johnny Jones Consumer Goods, 10,001 employees
2 def Tom D. Unspecified
2 def Connie J. Unspecified
2 def Ricky B. Self-employed