Reassigning unique column values to easier names-CodePudding

I am parsing a larger csv that looks loosely like so:

time   id        angle
0.0   1_2_3       ...
0.0   ad_42       ...
0.0   34_02_03    ...
0.1   1_2_3       ...
0.1   ad_42       ...
0.1   f_1         ...
....

As you can see, the id field has a lot of variety in the naming schematic, but definitely has values that repeat. My goal is to read in the csv and reassign id values as they appear while tracking the ones in common. So it would be nice to write into the dataframe and have output like so:

time   id      angle
0.0   id1       ...
0.0   id2       ...
0.0   id3       ...
0.1   id1       ...
0.1   id2       ...
0.1   id4       ...
....

Where the ids correspond but have a more human-readable form (ie 1-x).

Any advice would be greatly appreciated.

CodePudding user response：

You can do:

ids = df['id'].unique().tolist()
id_dict = {ids[i-1]:'id' str(i) for i in range(1,len(ids) 1)}
df['id'] = df['id'].map(id_dict)

The ids gives you the unique id values and to each unique id you assign a id number as in id_dict. Then map the dict onto your column to get the new values.

And note that you don't need to worry about the order of the values: unique() - preserves the order of the values in which they appear.

For chunks:

If you are dealing with chunks, you may do as follows - just need to take care to identify not available keys and the indices to append with 'id':

id_dict = {}
dict_size = 0
for chunk in chunks:
    chunk_na = chunk.loc[~chunk['id'].isin(id_dict)]
    if len(chunk_na) > 0:
        na_ids = chunk_na['id'].unique().tolist()
        id_dict.update({na_ids[i-1-dict_size]:'id' str(i) for i in range(dict_size 1,len(na_ids) dict_size 1)}
        dict_size  = len(id_dict)

    chunk['id'] = chunk['id'].map(id_dict)

CodePudding user response：

One approach would be to hash the id column with the hash_pandas_object() function. Since the hash is reproducible each time, you would have a consistently formatted ID that remains the same over time.

Here's a StackOverflow answer: Get the same hash value for a Pandas DataFrame each time

CodePudding user response：

Try using pd.factorize()

df['id'] = ['id{}'.format(i) for i in (pd.factorize(df['id'])[0] 1)]