Pandas Performance Optimization: Vectorization

Posted on 2020-07-08  22 Views

Recently, when working on a data analysis project, it involved traversing the DataFrame. Due to the large amount of data, each traversal will take a long time, so I began to try to optimize performance, and accidentally learned about Pandas' vectorization. This article will explore its specific application and compare its performance with other traversal methods.

What is vectorization?

Vectorization, as opposed to operating on each individual value (scalar), is the operation of the entire array supported by Pandas. This operation is suitable for universal operations on all data in a column. To take the simplest example, such as a professor giving all students a curve of 5 points in an exam, this vectorized idea can be adopted. Instead of taking the idea of "rows" and iterating through each student's grades and revising them, we take the idea of columns and directly apply the operation of adding 5 points to the entire data of this column, and Pandas will automatically apply the operation to each cell. And in practice, the vectorization operation of Pandas is implemented by the underlying C language, which will bring significant efficiency improvement.

Vectorization works for Pandas' Dataframe, Series object, and also Numpy's Series object.

Vectorization operation: try out

In this part, we take the example of extra points in the exam I mentioned above for the actual test. First, we generate a DataFrame consisting of two columns, student ID and corresponding grade, with a total of 1000 pieces of data, as shown in the figure:

How do I vectorize? In fact, Pandas for commonly used functions, such as summation, average, variance and other common statistical functions (all built-in functions that support vectorization please see the official Pandas documentation) has made very good vectorization support, in most simple scenarios, what we need to do is just to treat a whole column of elements as an element, Pandas will automatically apply the function to each cell. For example:

grade_df["grade"] += 5

Comparing the output twice, you can see that the grade column has changed. We used Jupyter Notebook to perform a temporal analysis of the above operations:

Performance comparison

Next, we use the traditional traversal method to perform comparisons. We use the .iterrows() method and the .apply() method, respectively.

.iterrows() method:

.apply() method:

Thus, vectorization operations are 37% faster than the fastest .apply() method, and an order of magnitude faster than iterrows() method.

Vectorization operations: Custom vectorization functions

Please note that not all functions support vectorization, and Pandas officials also do not give a clear specification, but simply mention the phrase "accept NumPy arrays and return another array or value". One of the situations listed in this article is for reference only, and the specific application scenarios need to be further studied.

Although Pandas has many commonly used vectorization functions built-in, it is obviously not enough to cover slightly more complex requirements. We continue to improve the above example, when the professor has completed the extra points, he needs to convert each student's 100-point score to the alphabet grade. Readers familiar with the .apply() method will think of writing a conditional function and then using lambda invokes. As shown in the picture:

For vectorization operations, can we call them in the same way? As we discussed earlier, we only need to pass a column of data into it as if it were a piece of data, and Pandas will automatically process it. From here, try the following:

We passed the entire grade column as an argument to letter_convert function, but returned an error:

"ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."

From the returned error message, it can be seen that Pandas did not perform vectorization, but passed the entire Series object as a parameter to the letter_convert function, and the if statement cannot judge the Boolean value of the entire Series object. Functions such as a.any(), a.all() and so on recommended for error messages are only applicable when it is necessary to generate a uniform output for all elements. For example: whether at least one person got an A, whether all of them passed the exam, etc. Obviously, the above function does not apply in this case, since we expect to generate a corresponding letter score for each element.

Pandas' vectorization functions do not support explicit if conditional statements.

But for the above type of function that uses if to determine the cell value and return the new value, we can find another way to replace it: the Numpy.where() function.

Numpy.where() function

According to the official Numpy documentation, the format of the Numpy.where function is: numpy.where(condition[,x,y]). It receives a Boolean expression containing the ndarray object and iterates through its elements, yielding x if the element is true, yielding y otherwise, and finally returning the modified ndarray object.

In this case, we can rewrite the letter_convert function to the form shown in the following figure:

def vectorized_letter_convert(x):
    r = np.where(x>=60, "D", "F")
    r = np.where(x>=70, "C", r)
    r = np.where(x>=80, "B", r)
    r = np.where(x>=90, "A", r)
    return r

Note: In this example and other situations involving numeric comparisons, as shown in the code above, it may be necessary to reverse the order of explicit if-elif-else conditions. When multiple where statements are executed in parallel, it is equivalent to multiple if statements executed in parallel, that is, each where statement will be executed on each cell.

In the end, it got its wish:

We increased the data volume to 10,000 for performance analysis:

.iterrows() method:

.apply() method:

Vectorization operations:

This shows the considerable performance gains brought by the vectorization approach. When it comes to traversing DataFrames, try to avoid using the iterrows method, and it is better to switch to the apply method and vectorization method. Vectorization of more complex functions is still being explored and is expected to be widely used in the future.

Reference Documentation:

  1. Pandas official documentation
  2. Numpy official documentation
  3. A Beginner’s Guide to Optimizing Pandas Code for Speed by Sofia Heisler