how to break up data in column value to multiple rows in pandas dataframe-CodePudding

I have an issue where I have multiple rows in a csv file that have to be converted to a pandas data frame but there are some rows where the columns 'name' and 'business' have multiple names and businesses that should be in separate rows and need to be split up while keeping the data from the other columns the same for each row that is split.

Here is the example data:

input:

software	name	business
abc	Andrew Johnson, Steve Martin	Outsourcing/Offshoring, 201-500 employees,Health, Wellness and Fitness, 5001-10,000 employees
xyz	Jack Jones, Rick Paul, Johnny Jones	Banking, 1001-5000 employees,Construction, 51-200 employees,Consumer Goods, 10,001 employees
def	Tom D., Connie J., Ricky B.	Unspecified, Unspecified, Self-employed

output I need:

software	name	business
abc	Andrew Johnson	Outsourcing/Offshoring, 201-500 employees
abc	Steve Martin	Health, Wellness and Fitness, 5001-10,000 employees
xyz	Jack Jones	Banking, 1001-5000 employees
xyz	Rick Paul	Construction, 51-200 employees
xyz	Johnny Jones	Consumer Goods, 10,001 employees
def	Tom D	Unspecified
def	Connie J	Unspecified
def	Ricky B	Self-employed

There are additional columns similar to 'name' and 'business' that contain multiple pieces of information that need to be split up just like 'name' and 'business'. Cells that contain multiple pieces of information are in sequence (ordered).

Here's the code I have so far and creates new rows but it only splits up the contents in name column, but that leaves the business column and a few other columns left over that need to be split up along with the contents from the name column.

name2 = df.name.str.split(',', expand=True).stack()
df = df.join(pd.Series(index=name2.index.droplevel(1), data=name2.values, name = 'name2'))

dict = df.to_dict('record')
for row in dict:
    new_segment = {}
    new_segment['name'] = str(row['name2'])
    #df['name'] = str(row['name2'])

    for col,content in new_segment.items():
            row[col] = content

df = pd.DataFrame.from_dict(dict)

df = df.drop('name2', 1)

Here's an alternative solution I was trying as well but it gives me an error too:

review_path = r'data/base_data'
review_files = glob.glob(review_path   "/test_data.csv")

review_df_list = []
for review_file in review_files:
    df = pd.read_csv(io.StringIO(review_file), sep = '\t')
    print(df.head())
    df["business"] = (df["business"].str.extractall(r"(?:[\s,]*)(.*?(?:Unspecified|employees|Self-employed))").groupby(level=0).agg(list))
    df["name"] = df["name"].str.split(r"\s*,\s*")
    print(df.explode(["name", "business"]))
    outPutPath = Path('data/base_data/test_data.csv')
    df.to_csv(outPutPath, index=False)

Error Message for alternative solution:

Read:data/base_data/review_base.csv

Success!

Empty DataFrame

Columns: [data/base_data/test_data.csv]

Index: []

Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'business'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/taha/Desktop/X2Analytics/project4/tryingfull.py", line 382, in df["business"] = (df["business"].str.extractall(r"(?:[\s,])(.?(?:Unspecified|employees|Self-employed))").groupby(level=0).agg(list)) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in getitem indexer = self.columns.get_loc(key) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc raise KeyError(key) from err KeyError: 'business'

CodePudding user response：

Try:

df["business"] = (
    df["business"]
    .str.extractall(r"(?:[\s,]*)(.*?(?:Unspecified|employees|Self-employed))")
    .groupby(level=0)
    .agg(list)
)
df["name"] = df["name"].str.split(r"\s*,\s*")

print(df.explode(["name", "business"]))

Prints:

  software            name                                             business
0      abc  Andrew Johnson            Outsourcing/Offshoring, 201-500 employees
0      abc    Steve Martin  Health, Wellness and Fitness, 5001-10,000 employees
1      xyz      Jack Jones                         Banking, 1001-5000 employees
1      xyz       Rick Paul                       Construction, 51-200 employees
1      xyz    Johnny Jones                    Consumer Goods, 10,001  employees
2      def          Tom D.                                          Unspecified
2      def       Connie J.                                          Unspecified
2      def        Ricky B.                                        Self-employed