copying specific value x times into a new column-CodePudding

I am having a hard time transforming an unstructured data thats in an excel sheet into a structured tabular format in python so I can do data analysis on it.

I want to copy the location (i.e. Paris) into Col 4 to its corresponding rows. The rows below # of x represent one person, which can vary. For example Milan has 3 entries while London has 4.

Additionally, removing the rows with no entries like Paris and Rome.

I kind of have an idea but do not know how to implement it. If Col 2 is not # of Cafes to Visit or not a numeric then copy that value into Col 4 until you find the next entry...not sure about this though :(

Can someone help me?

Input:

Col 1	Col 2	Col 3
Location	Paris
# of Shops To Visit	# of Cafes to Visit	# of Museums to Visit
Location	Milan
# of Shops To Visit	# of Cafes to Visit	# of Museums to Visit
3	5	3
2	4	4
5	6	7
Location	London
# of Shops To Visit	# of Cafes to Visit	# of Museums to Visit
6	6	2
3	5	0
5	4	1
5	4	1
Location	Rome
# of Shops To Visit	# of Cafes to Visit	# of Museums to Visit

Output:

Col 1	Col 2	Col 3	Col 4
3	5	3	Milan
2	4	4	Milan
5	6	7	Milan
6	6	2	London
3	5	0	London
5	4	1	London
5	4	1	London

CodePudding user response：

Try:

#Removing the rows with no entries like Paris and Rome.
df['dummy'] = df['Col 2'].replace('# of Cafes to Visit|[0-9] ', np.nan, regex=True).ffill()
df = df.groupby('dummy').filter(lambda x: len(x) > 2).drop(columns=['dummy'])

#Moving Locations to Col 4
df['Col 4'].fillna(df['Col 2'], inplace=True)
df['Col 4'].replace('# of Cafes to Visit|[0-9] ', np.nan, regex=True, inplace=True)
df['Col 4'].ffill(inplace=True)

df = df[~df['Col 1'].isin(['# of Shops To Visit', 'Location'])]
df[['Col 1', 'Col 2', 'Col 3']] = df[['Col 1', 'Col 2', 'Col 3']].replace(r'\D ', np.nan, regex=True)
df.reset_index(drop=True, inplace=True)

Output:

  Col 1 Col 2 Col 3   Col 4
0     3     5     3   Milan
1     2     4     4   Milan
2     5     6     7   Milan
3     6     6     2  London
4     3     5     0  London
5     5     4     1  London
6     5     4     1  London

CodePudding user response：

To approach this problem, I first gathered the cities individually:

data = pd.read_csv(path,delim_whitespace=True,header=None,names=['col1','col2','col3'])
cities = data[data['col1']=='Location']['col2'].reset_index(drop=True)

I then figured out which row would apply to each city in the above data series by finding when the data in 'col3' was not null:

city_inds = np.cumsum(np.logical_not(pd.notna(data['col3'])))-1

Finally, you can use these indices into the original cities data series to allocate the correct city into to the original dataframe. Then, we can drop the rows that are irrelevant:

data['cities'] = cities.iloc[city_inds].reset_index(drop=True)
data = data[data['col1'].str.isnumeric()].reset_index(drop=True)  #drop rows which aren't numeric in col1
Out[]:   col1 col2 col3  cities
0     3    5    3   Milan
1     2    4    4   Milan
2     5    6    7   Milan
3     6    6    2  London
4    3    5    0  London
5    5    4    1  London
6    5    4    1  London