I'm trying to create several new columns in a pandas dataframe that merge portions of multiple other columns based on year (yet another column) and column headers. The layout of the dataframe currently is:
2000_A1 | 2000_A2 | 2001_A1 | 2001_A2 | Year | Latitude | Longitude |
---|---|---|---|---|---|---|
2 | 8 | 0 | 3 | 2000 | 43.65 | -76.26 |
3 | 8 | 9 | 4 | 2000 | 43.66 | -76.26 |
3 | 2 | 5 | 3 | 2000 | 43.67 | -76.26 |
4 | 6 | 5 | 1 | 2001 | 43.68 | -76.26 |
7 | 8 | 2 | 0 | 2001 | 43.69 | -76.26 |
1 | 3 | 1 | 1 | 2001 | 43.70 | -76.26 |
I would like to create a dataframe that has two new columns (A1 and A2) that combine values based on year, as long as the year matches the column header. Originally, these were two separate dataframes that I merged (pd.merge()
) based on coordinates (lat/long), which is why they're organized in different manners. Ideally, the new dataframe would look like this:
2000_A1 | 2000_A2 | 2001_A1 | 2001_A2 | Year | A1 | A2 | Latitude | Longitude |
---|---|---|---|---|---|---|---|---|
2 | 8 | 0 | 3 | 2000 | 2 | 8 | 43.65 | -76.26 |
3 | 8 | 9 | 4 | 2000 | 3 | 8 | 43.66 | -76.26 |
3 | 2 | 5 | 3 | 2000 | 3 | 2 | 43.67 | -76.26 |
4 | 6 | 5 | 1 | 2001 | 5 | 1 | 43.68 | -76.26 |
7 | 8 | 2 | 0 | 2001 | 2 | 0 | 43.69 | -76.26 |
1 | 3 | 1 | 1 | 2001 | 1 | 1 | 43.70 | -76.26 |
If possible, I would like to use a for loop to create these new columns - in my actual dataset, I have about 20 years and either 7 or 9 'A's for each year, so some sort of iterative loop would be great. I'm new to python and struggling to figure out how to approach this, so any help would be much appreciated.
CodePudding user response:
You can use a combination of .melt
and .pivot
, optionally with merge
to preserve your original {year}_A{n}
columns. This will be likely much faster than an iterative solution (including things along the lines of .apply
)
ff = df.copy() # we will use this later
# split your variable column(s) into an identifier and value column
df = df.melt(id_vars=["Year", "Latitude", "Longitude"])
# parse the parts of {year}_A{n}
parts = df.variable.str.split("_")
df["var_year"] = parts.str[0].astype(int)
df["var_kind"] = parts.str[1]
df = df[df.var_year == df.Year] # use your filtering criteria
ident = ["Year", "Latitude", "Longitude"]
# pivot the frame to get A{n} columns
df = df.pivot(ident, columns="var_kind", values="value")
df = df.reset_index()
df = pd.merge(ff, df, on=ident) # retain original {year}_A{n}
2000_A1 | 2000_A2 | 2001_A1 | 2001_A2 | Year | Latitude | Longitude | A1 | A2 | |
---|---|---|---|---|---|---|---|---|---|
0 | 2 | 8 | 0 | 3 | 2000 | 43.65 | -76.26 | 2 | 8 |
1 | 3 | 8 | 9 | 4 | 2000 | 43.66 | -76.26 | 3 | 8 |
2 | 3 | 2 | 5 | 3 | 2000 | 43.67 | -76.26 | 3 | 2 |
3 | 4 | 6 | 5 | 1 | 2001 | 43.68 | -76.26 | 5 | 1 |
4 | 7 | 8 | 2 | 0 | 2001 | 43.69 | -76.26 | 2 | 0 |
5 | 1 | 3 | 1 | 1 | 2001 | 43.7 | -76.26 | 1 | 1 |
CodePudding user response:
You could use an pd.apply
on the whole line, catching the desired year to build the column name.
def get_entries(line, new_entry="A1"):
year = line.year
used_feature = f"{year}_{new_entry}"
return line[used_feature]
df["A1"] = df.apply(lambda line : get_entries(line, "A1"), axis=1)
df["A2"] = df.apply(lambda line : get_entries(line, "A2"), axis=1)