i have this series which contains country,state,city and i would like to extract them accordingly- refer to the output table
Region |
---|
US* |
Arizona** |
Phoenix |
Mesa |
California** |
Los Angeles |
San Diego |
Sacramento |
Florida** |
Tampa |
Miami |
Canada* |
Central Canada** |
Montreal |
London |
my desired output
Region | State | City | |
---|---|---|---|
US* | Arizona** | Phoenix | |
US* | Arizona** | Mesa | |
US* | California** | Los Angeles | |
US* | California** | San Diego | |
US* | California** | Sacramento | |
US* | Florida** | Tampa | |
US* | Florida** | Miami | |
Canada* | Central Canada** | Montreal | |
Canada* | Central Canada** | London |
is this even possible?
I tried some panda operations with isin() but failed miserably
CodePudding user response:
of course it's possible:
def split_by_country(region_list: pd.Series):
result = []
start_idx = None
for i, region in enumerate(region_list):
if region.endswith("*") and not region.endswith("**"):
if start_idx is None:
start_idx = i
elif isinstance(start_idx, int):
result.append(region_list[start_idx: i])
start_idx = i
result.append(region_list[start_idx:])
return result
countries = split_by_country(regions_s)
countries
Above code will splits the series/list of regions to list of lists. Every sublist starts (index 0) with country name. Then u can do something like that:
country_dict = {country[0]: split_by_region(country[1:])
for country in countries}
split_by_region
is the same as split_by_country by with different condition (region.endswith("*") and not region.endswith("**")
> region.endswith("**")
)
and at the end to (belowe code i write without checking, so it may contains some syntax error) :
result_df = pd.DataFrame(columns=["country","subregion","city"])
for i, (country, subregions) in enumerate(country_dict.iteritems()):
for subregion, city in subregions.iteritems():
result_df.loc[i] = [country, subregion, city]