I created this dataframe with pySpark from txt file that includes searches queries and user ID.
`spark = SparkSession.builder.getOrCreate()
df = spark.read.option("header", "true") \
.option("delimiter", "\t") \
.option("inferSchema", "true") \
.csv("/content/drive/MyDrive/my_data.txt")
df.select("AnonID","Query").show()`
And it look like that:
------ --------------------
|AnonID| Query|
------ --------------------
| 142| rentdirect.com|
| 142|www.prescriptionf...|
| 142| staple.com|
| 142| staple.com|
| 142|www.newyorklawyer...|
| 142|www.newyorklawyer...|
| 142| westchester.gov|
| 142| space.comhttp|
| 142| dfdf|
| 142| dfdf|
| 142| vaniqa.comh|
| 142| www.collegeucla.edu|
| 142| www.elaorg|
| 142| 207 ad2d 530|
| 142| 207 ad2d 530|
| 142| broadway.vera.org|
| 142| broadway.vera.org|
| 142| vera.org|
| 142| broadway.vera.org|
| 142| frankmellace.com|
| 142| ucs.ljx.com|
| 142| attornyleslie.com|
| 142|merit release app...|
| 142| www.bonsai.wbff.org|
| 142| loislaw.com|
| 142| rapny.com|
| 142| whitepages.com|
| 217| lottery|
| 217| lottery|
| 217| ameriprise.com|
| 217| susheme|
| 217| united.com|
| 217| mizuno.com|
| 217|p; .; p;' p; ' ;'...|
| 217|p; .; p;' p; ' ;'...|
| 217|asiansexygoddess.com|
| 217| buddylis|
| 217|bestasiancompany.com|
| 217| lottery|
| 217| lottery|
| 217| ask.com|
| 217| weather.com|
| 217| wellsfargo.com|
| 217|www.tabiecummings...|
| 217| wanttickets.com|
| 217| yahoo.com|
| 217| -|
| 217| www.ngo-quen.org|
| 217| -|
| 217| vietnam|
------ --------------------
What I want to do is that each user ID will be a row and each query will be in a column.
------ ------------ ---------
|ID | 1 | 2 | 3 .......
------ ------------ ---------
|142| query1|query2| query3
|217| query1|query2| query3
|993| query1|query2| query3
|1268| query1|query2| query3
|1326| query1|query2| query3
.
.
.
I tried to switch between rows and columns with the help of a search I did on Google, but I didn't succeed.
CodePudding user response:
You can group the dataframe by AnonID
, and then pivot the Query
column to create new columns for each unique query:
df = df.groupBy("AnonID").pivot("Query").agg(F.first("Query"))
If you have a lot of distinct values try
df = df.groupBy("AnonID").agg(F.collect_list("Query").alias("Queries"))
You can then rename the columns to 1, 2, 3, etc.
df = df.selectExpr("AnonID", *[f"`{i 1}` as `{i 1}`" for i in range(len(df.columns)-1)])