I am having a hard time transforming an unstructured data thats in an excel sheet into a structured tabular format in python so I can do data analysis on it.
I want to copy the location (i.e. Paris) into Col 4
to its corresponding rows. The rows below # of x
represent one person, which can vary. For example Milan has 3 entries while London has 4.
Additionally, removing the rows with no entries like Paris and Rome.
I kind of have an idea but do not know how to implement it. If Col 2
is not # of Cafes to Visit
or not a numeric then copy that value into Col 4
until you find the next entry...not sure about this though :(
Can someone help me?
Input:
Col 1 | Col 2 | Col 3 | Col 4 |
---|---|---|---|
Location | Paris | ||
# of Shops To Visit | # of Cafes to Visit | # of Museums to Visit | |
Location | Milan | ||
# of Shops To Visit | # of Cafes to Visit | # of Museums to Visit | |
3 | 5 | 3 | |
2 | 4 | 4 | |
5 | 6 | 7 | |
Location | London | ||
# of Shops To Visit | # of Cafes to Visit | # of Museums to Visit | |
6 | 6 | 2 | |
3 | 5 | 0 | |
5 | 4 | 1 | |
5 | 4 | 1 | |
Location | Rome | ||
# of Shops To Visit | # of Cafes to Visit | # of Museums to Visit |
Output:
Col 1 | Col 2 | Col 3 | Col 4 |
---|---|---|---|
3 | 5 | 3 | Milan |
2 | 4 | 4 | Milan |
5 | 6 | 7 | Milan |
6 | 6 | 2 | London |
3 | 5 | 0 | London |
5 | 4 | 1 | London |
5 | 4 | 1 | London |
CodePudding user response:
Try:
#Removing the rows with no entries like Paris and Rome.
df['dummy'] = df['Col 2'].replace('# of Cafes to Visit|[0-9] ', np.nan, regex=True).ffill()
df = df.groupby('dummy').filter(lambda x: len(x) > 2).drop(columns=['dummy'])
#Moving Locations to Col 4
df['Col 4'].fillna(df['Col 2'], inplace=True)
df['Col 4'].replace('# of Cafes to Visit|[0-9] ', np.nan, regex=True, inplace=True)
df['Col 4'].ffill(inplace=True)
df = df[~df['Col 1'].isin(['# of Shops To Visit', 'Location'])]
df[['Col 1', 'Col 2', 'Col 3']] = df[['Col 1', 'Col 2', 'Col 3']].replace(r'\D ', np.nan, regex=True)
df.reset_index(drop=True, inplace=True)
Output:
Col 1 Col 2 Col 3 Col 4
0 3 5 3 Milan
1 2 4 4 Milan
2 5 6 7 Milan
3 6 6 2 London
4 3 5 0 London
5 5 4 1 London
6 5 4 1 London
CodePudding user response:
To approach this problem, I first gathered the cities individually:
data = pd.read_csv(path,delim_whitespace=True,header=None,names=['col1','col2','col3'])
cities = data[data['col1']=='Location']['col2'].reset_index(drop=True)
I then figured out which row would apply to each city in the above data series by finding when the data in 'col3' was not null:
city_inds = np.cumsum(np.logical_not(pd.notna(data['col3'])))-1
Finally, you can use these indices into the original cities data series to allocate the correct city into to the original dataframe. Then, we can drop the rows that are irrelevant:
data['cities'] = cities.iloc[city_inds].reset_index(drop=True)
data = data[data['col1'].str.isnumeric()].reset_index(drop=True) #drop rows which aren't numeric in col1
Out[]: col1 col2 col3 cities
0 3 5 3 Milan
1 2 4 4 Milan
2 5 6 7 Milan
3 6 6 2 London
4 3 5 0 London
5 5 4 1 London
6 5 4 1 London