I have a pyspark dataframe that looks like this:
df = spark.sql("""
SELECT list
FROM categories
""")
df.show()
list |
---|
1,1,1,2,2,apple |
apple,orange,1,2 |
How can I generate a field that counts the elements separated by commas? The ideal dataframe looks like this:
list | count |
---|---|
1,1,1,2,2,apple | 6 |
apple,orange,1,2 | 4 |
I would also like to create looks like this:
list | frequency_count |
---|---|
1 | 4 |
2 | 3 |
apple | 2 |
orange | 1 |
CodePudding user response:
For me, it is laborious to write using dataframe, and the SQL implementation is as follows.
count_df = df.withColumn('count', F.size(F.split('list', ',')))
count_df.show(truncate=False)
df.createOrReplaceTempView('tmp')
freq_sql = """
select list,count(*) count from
(select explode(flatten(collect_list(split(list, ',')))) list
from tmp)
group by list
"""
freq_df = spark.sql(freq_sql)
freq_df.show(truncate=False)
CodePudding user response:
You actually have two questions. To get the count of items of a ',' separator string you can do like this:
import pandas as pd
df = pd.DataFrame(data=["1,1,1,2,2,apple", "apple,orange,1,2"], columns=["list"])
df["count"] = df["list"].apply(lambda x: len(x.split(",")))
print(df)
which gives as output:
list count
0 1,1,1,2,2,apple 6
1 apple,orange,1,2 4
I use a lambda function here which does the operation split on each item (returning a list of separate items in the string) and then counts the items with len(). You can use the apply method to apply the lambda function to each item in the column.
In the second question you want to count the unique elements of each item in the dataframe. You can first use the split method to split the string items into separate columns:
df_items = df["list"].str.split(",", expand=True)
print(df_items)
which gives you a dataframe like this:
0 1 2 3 4 5
0 1 1 1 2 2 apple
1 apple orange 1 2 None None
Then, since the location of each unique item does not matter you can glue all the columns together to one single column using the melt method. The melt method puts all the items in a column called value. You can get the item count of the columns using value_count. The script looks like this:
frequency_count = pd.melt(df_items)['value'].value_counts()
frequency_count.index.name = "list"
frequency_count = pd.DataFrame(frequency_count).rename(columns={"value": "frequency_count"})
which gives you the output
frequency_count
list
1 4
2 3
apple 2
orange 1