Home > database >  DataFrame challenge: mapping ID to value in different row. Preferably with Polars
DataFrame challenge: mapping ID to value in different row. Preferably with Polars

Time:05-14

Consider this example:

import polars as pl

df = pl.DataFrame({
    'ID': ['0', '1', '2', '3', '4', '5','6', '7', '8', '9', '10'],
    'Name' : ['A','','','','B','','C','','','D', ''], 
    'Element' : ['', '4', '4', '0', '', '4', '', '0', '9', '', '6']
})

The 'Name' is linked to an 'ID'. This ID is used as a value in the 'Element' column. How do I map the correct 'Name' to the elements? Also I want to group the elements by 'Name' ('Name_list'), count them and sort by counted values ('E_count').

The resulting df would be:

Name_list Element E_count
-------------------------
'B'       '4'     3
'A'       '0'     2
'C'       '6'     1
'D'       '9'     1

Feedback very much appreciated; even a Pandas solution.

CodePudding user response:

If I understood your problem correctly, then you could use pandas and do the following:

countdf = pd.merge(df,df[['ID','Name']],left_on='Element',right_on='ID',how='inner')
countdf = pd.DataFrame(countdf.groupby('Name_y')['Element'].count())

result = pd.merge(countdf,df[['Name','ID']],left_on='Name_y',right_on='Name',how='left')
result[['Name','ID','Element']]

CodePudding user response:

Here's a Polars solution. We'll use a join to link the ID and Element columns (after some filtering and summarizing).

(
    df.select(["Name", "ID"])
    .filter(pl.col("Name") != "")
    .join(
        df.select(["Element"]).groupby("Element").agg(
            pl.count().alias("E_count")),
        left_on="ID",
        right_on="Element",
        how="left",
    )
    .sort('E_count', reverse=True)
    .rename({"Name":"Name_list", "ID":"Element"})
)

Note: this differs from the solution listed in your answer. The Name D is associated with ID 9 (not 10).

shape: (4, 3)
┌───────────┬─────────┬─────────┐
│ Name_list ┆ Element ┆ E_count │
│ ---       ┆ ---     ┆ ---     │
│ str       ┆ str     ┆ u32     │
╞═══════════╪═════════╪═════════╡
│ B         ┆ 4       ┆ 3       │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ A         ┆ 0       ┆ 2       │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ C         ┆ 6       ┆ 1       │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ D         ┆ 9       ┆ 1       │
└───────────┴─────────┴─────────┘

You can also use the polars.Series.value_counts method, which looks somewhat cleaner:

(
    df.select(["Name", "ID"])
    .filter(pl.col("Name") != "")
    .join(
        df.get_column("Element").value_counts(),
        left_on="ID",
        right_on="Element",
        how="left",
    )
    .sort("counts", reverse=True)
    .rename({"Name": "Name_list", "ID": "Element", "counts": "E_count"})
)

CodePudding user response:

using pandas We can use map to map the values and using the where condition to keep from making name as null. lastly, its a groupby

df['Name'] = df['Name'].where(cond=df['Element']=="", 
                       other=df[df['Element']!=""]['Element'].map(lambda x: df[df['ID'] == x]['Name'].tolist()[0]), 
                       axis=0)
df[df['Element'] != ""].groupby(['Name','Element']).count().reset_index()
    Name    Element     ID
0   A   0   2
1   B   4   3
2   C   6   1
3   D   9   1
  • Related