This error was something I saw at the same time as the error I discussed in my previous blogpost (here), where we are seeing conflicting data types when trying to divide each value of a count of values by the number of days in 3 months (approximately) to get a frequency value over 3 months. I did show the code to fix the error we will discuss in the previous blog post, but I will go into more detail here. The code (without the line that fixes it) that incurs the error is below:

All_Time_Visit_Count = data.groupby(“UniqueIdentifier”).count()
ndays = data.select(“Datekey”).distinct().count()
ndays = ndays/91.2501
Frequency = All_Time_Visit_Count.rdd.map(lambda x: x.count/ndays)

The error we get shows that the two data types we are using in the divide are ‘builtin_function_or_method’ and ‘float’. But why  would our variable being pulled by the lambda function be seen as a function? This is down to column naming. When we do a count transformation on a dataset PySpark will automatically make a column called ‘count’. As we know from the previous blogpost we must index this column from x in the lambda function or we will be trying to divide a ‘row’ data type; so we pass ‘count’ as that is what the column is called. Therein lies the problem, ‘.count’ is also a function, we even used it earlier to create the column. This is why we get the error we are discussing returned. The fix is easy though, and only requires us to rename the column once we’ve created:

All_Time_Visit_Count = All_Time_Visit_Count.withColumnRenamed(“count”,”Frequency”) 

Then declare ‘Frequency’ in your lambda function instead of ‘count’ and you won’t see this error again.

Tags: , ,