Home > OS >  How to split a a string data into the same column / single column - without affecting the dataset it
How to split a a string data into the same column / single column - without affecting the dataset it

Time:12-31

The data set I'm working on has got a column with zipcodes in it. Some entries only have one zipcode, some have 2, 5, or 10 zipcodes. Like this:

Zipcode(s)
1245
5863, 5682, 1995
6978, 1123, 5659, 34554
4539, 6453

I want to do some simple analysis -- apply a value_counts() on the column to see what zipcodes are the most popular. But I can't properly do it since most cells have multiple zipcodes on them. That's also the reason why I want a way where it won't affect the dataset itself, just that specific instance where all zipcodes are split and are in one column.

I've tried splitting them into multiple columns with .str.split(',',n=20, expand=True) but that's not really what I'm looking for. I want them all split into a single column.

CodePudding user response:

I think pandas.DataFrame.explode is what you're looking for.
With this, you take all values from lists (which you created with the split function) to a new row.

import pandas as pd

df = pd.DataFrame({
    "Zipcodes":["8000", "2000, 2002, 3003", "8000, 2002", "3004, 2004, 3003"]
})

df

(
    df.Zipcodes
    .str.replace(" ", "") # optional, if you don't need this then 
    .str.split(",")       # use ", " instead of ","
    .explode()
    .value_counts()
)

Output:

8000    2
2002    2
3003    2
2000    1
3004    1
2004    1

CodePudding user response:

You can use this python snippet below:

import pandas as pd
df = pd.DataFrame({
    "Zipcode(s)" : ["1245", "5863, 5682, 1995", "6978, 1123, 5659, 34554", "4539, 6453"]
})
df["Zipcode(s)"] = df["Zipcode(s)"].map(lambda zcode: zcode.split(", "))
zipcodes = sum(df["Zipcode(s)"].to_list(), [])
#create dummy(empty) dataframe
dummydf = pd.DataFrame({"Zipcode(s)" : zipcodes})
print(dummydf["Zipcode(s)"].value_counts())

Output:

1245     1
5863     1
5682     1
1995     1
6978     1
1123     1
5659     1
34554    1
4539     1
6453     1
Name: Zipcode(s), dtype: int64
  • Related