How my lack of knowledge of mocking in Scala Spark applications was affecting code structure

Sep 25, 2024 Nurgazy Nazhimidinov

I was working with Scala Spark applications recently. Because I didn’t know much about mocking table reads in Spark and had to write a unit test for a function, I removed all the spark.read calls from the function. When I looked at the function to be tested, I noticed that it had become less cohesive.

Here is the version of the function without spark.read calls:

def processThirdPartyEvents(entities: DataFrame, subEntities: DataFrame, entityEvents: DataFrame, thirdPartyEvents: DataFrame)

And how it is used

val entityEventTable = applicationConfig.getString("entity_event_table")
val entityEvents = getEntityEvents(entityEventTable, jobConfig.year, jobConfig.month, jobConfig.day).cache()

val thirdPartyEventTable = applicationConfig.getString("third_party_event_table")
val startDate = LocalDate.of(jobConfig.year.toInt, jobConfig.month.toInt, jobConfig.day.toInt)
val endDate = startDate.plusDays(1)

val thirdPartyEvents = getThirdPartyEvents(thirdPartyEventTable, startDate, endDate)

processThirdPartyEvents(entities, subEntities, entityEvents, thirdPartyEvents)

What I don’t like about this declaration is that I have to provide entityEvents and thirdPartyEvents as arguments in order to test processThirdPartyEvents. I also don’t like that the caller needs to know about those events. Had I known more about mocking spark.read, I would have called spark.read directly inside processThirdPartyEvents instead of requiring those events as parameters. I don’t want to make this function public solely for testing purposes; I wanted to hide it and test it as a black box.

I much prefer this declaration because it hides which events are used, so the caller doesn’t need to be concerned with them:

def processThirdPartyEvents(entities: DataFrame, subEntities: DataFrame)

So my advice is: learn more about the tools you use!