I imported a local csv file into my spark scala project. I have to calculate some statistics from the data. One of the calculations involve grouping the data by the date field and counting the occurrences by id. The second calculation involves showing the top 100 names in the data that appear frequently.
Any help would be appreciated on these items.
For the first one, I am able to create the "Month" column, but null values are there instead of just the numeric month so that I can group by the new "Month" column. Can someone help with this?
TestData_csv.withColumn("Month",date_format(to_date(col("date")), "MM")).show(false)
CodePudding user response:
You were almost there!
The problem in your case was that the to_date
function you were using did not understand what format the strings were written in. There is also a to_date
function that accepts a fmt
string to make Spark understand how to parse your strings.
That would looks something like this:
import spark.implicits._
// This is just the data you posted in a comment on your question
val df = spark.read.option("sep", ";").option("header", "true").csv("./dateTimeCSV.csv")
df.show
--- ---------- --------- --------- -----
| Id| Date| Fname| Lname|Route|
--- ---------- --------- --------- -----
| 1| 1/1/2017| Greg| Rollo| S|
| 2| 1/2/2017| Frankie| Allen| N|
| 2| 1/3/2017| Denise| Smith| E|
| 2| 2/11/2017| Earlene| Lane| W|
| 2| 3/22/2017| Donovan| Joseph| SE|
| 3| 3/11/2017| Rebecca| Bowls| NE|
| 3| 4/5/2017|Katherine| Cook| NW|
| 4| 5/16/2017| Alicia| Mason| SW|
| 4| 6/9/2017| Bob| Peterson| S|
| 4| 6/30/2017| Janet| Love| N|
| 5| 7/2/2017| Richard| Dingle| E|
| 5| 8/25/2017| Thomas|Velasquez| W|
| 5| 8/10/2017| Susan| King| SE|
| 1| 9/25/2017| Pratesh| Venkat| NE|
| 1|10/14/2017| Neha| Kumar| NW|
| 1| 11/2/2017| Louis| Williams| SW|
| 3|11/28/2017| Winston| McLean| E|
| 3| 12/2/2017| Clark| Kent| N|
| 3|12/14/2017| Bruce| Wayne| S|
--- ---------- --------- --------- -----
val output = df.withColumn("Month",date_format(to_date(df("Date"), "MM/dd/yyyy"), "MM"))
output.show
--- ---------- --------- --------- ----- -----
| Id| Date| Fname| Lname|Route|Month|
--- ---------- --------- --------- ----- -----
| 1| 1/1/2017| Greg| Rollo| S| 01|
| 2| 1/2/2017| Frankie| Allen| N| 01|
| 2| 1/3/2017| Denise| Smith| E| 01|
| 2| 2/11/2017| Earlene| Lane| W| 02|
| 2| 3/22/2017| Donovan| Joseph| SE| 03|
| 3| 3/11/2017| Rebecca| Bowls| NE| 03|
| 3| 4/5/2017|Katherine| Cook| NW| 04|
| 4| 5/16/2017| Alicia| Mason| SW| 05|
| 4| 6/9/2017| Bob| Peterson| S| 06|
| 4| 6/30/2017| Janet| Love| N| 06|
| 5| 7/2/2017| Richard| Dingle| E| 07|
| 5| 8/25/2017| Thomas|Velasquez| W| 08|
| 5| 8/10/2017| Susan| King| SE| 08|
| 1| 9/25/2017| Pratesh| Venkat| NE| 09|
| 1|10/14/2017| Neha| Kumar| NW| 10|
| 1| 11/2/2017| Louis| Williams| SW| 11|
| 3|11/28/2017| Winston| McLean| E| 11|
| 3| 12/2/2017| Clark| Kent| N| 12|
| 3|12/14/2017| Bruce| Wayne| S| 12|
--- ---------- --------- --------- ----- -----
As you can see, I added a fmt
string with as value "MM/dd/yyyy"
to make Spark understand how to parse the date.
Hope this helps!