Here’s the first in what will be an adhoc series of short blog posts where I will write a paragraph on the solution to problems I come across when I’m using PySpark.

In this post I will discuss using the Map() function to apply a function to every value in an RDD and then getting the error message:

“TypeError: unsupported operand type(s) for /: ‘Row’ and ‘float’ “

We see this error because we are trying to do a function which involves two incompatible data types; in this case a row and a float. When I came across this error, I was trying to divide each value of a count of values by the number of days in 3 months (approximately) to get a frequency value over 3 months. Code below:

All_Time_Visit_Count = data.groupby(“UniqueIdentifier”).count()
All_Time_Visit_Count = All_Time_Visit_Count.withColumnRenamed(“count”,”Frequency”) # this is essential as you get a separate error later on in the lambda function is a column name of ‘count’ as read as a function and will break it as well. I will be doing a further blog post on this.
ndays =“Datekey”).distinct().count()
ndays = ndays/91.2501
Frequency = x: x/ndays)
If you are familiar with PySpark, you will know due to lazy evaluation you won’t see the error outputted from running the above code as we have only performed transformations, so you will need to perform an action such as converting to dataframe for the code to actually run and see the error. But the error is not with the action but the transformations earlier on, this is essential to keep in mind!


Regardless the problem here is that we are trying to divide a value of row datatype by our integer variable ndays. This is due to how the lambda function works in declaring each variable of ‘x’. Even though we may think we have declared the actual variables above, what we are actually declaring is each row the variables are contained in. To fix this we must declare the actual value from this row value like so:
Frequency = x.Frequency/ndays)
This will now work.

Tags: ,