# What is the need of Parallel Processing for Machine Learning in Real Time ?

Before starting about the **Parallel Processing** we will try to put some light on **Machine Learning** definition.

Machine Learning

Again going with the layman’s terms **Machine learning **is the subset of** Artificial Intelligence **provides us statistical tool to explore data** .
**Machine learning (ML) is the application of artificial intelligence (AI) through a family of algorithms that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.

In Machine learning -:

- The system is able to make prediction or decision based on past data . - Needs only small amount of data . - Works well with low end computers .

Python is the language of reference for ML and many libraries provide off-the-shelf machine learning algorithms, like scikit-learn, XGBoost, LightGBM or TensorFlow.

Parallel Processing :-

Parallel Processing simply means algorithms are deployed across the multiple processors .Usually this means distributed processing , a typical ML algorithm involves doing a lot of computation (work/tasks) on a lot of data set . Traditionally (whatever that means in this context), machine learning has been executed in single processor environments, where algorithmic bottlenecks can lead to substantial delays in model processing, from training to classification to distance and error calculation and beyond.

In order to achieve economy of performance. The lack of parallel processing or parallel execution on a shared memory architecture involves doing a lot of utilization(work/tasks) on a lot of data set after performing independent tasks . So in my opinion parallel programming is hard. It really is.

From the above statement we conclude that Parallel Processing is not magic and cannot be used in each and every situation . There are both practical and theoretical algorithmic design issues that must be considered when even thinking about incorporating parallel processing into a project. However, with Big Data encompassing such large amounts of data, sets of which are increasingly being relied upon for routine machine learning tasks, the trouble associated with parallelism may very well be worth it in a given situation due solely to the potential of dramatic time-savings related to algorithm execution.

PARALLELIZING COMPUTATIONS TASKS

Tasks that have NO DEPENDENCIES are trivially parallelized e.g - k-folds of CV, HyperParameter Tuning (with Grid Search) - Predicting many rows in parallel (once Trained Model is available in memory).

Then, a more difficult part is of parallelizing the various Tasks that happen WITHIN EACH ALGORITHM e.g.

this requires getting under the hood of each algorithm’s estimation process. in some algos, this is still easy e.g. in Random Forest, can train its N trees in parallel. In GBM (that may appear sequential), the repetitive task of ‘which variable to now split on’ can still be parallelized But for other algos, this many not be straightforward e.g. derivative-based iterative algorithms like SGD or Optimization.

PARALLELIZING FOR DISTRIBUTED DATA

- Big data used in Training is commonly held in MapReduce Frameworks (sits across many machines) - MapReduce works best if very small data moves across these machines very few times. But most ML Algos have ITERATE-TILL-CONVERGENCE design e.g. at each iteration, derivative-based methods need to calculate gradients over WHOLE Training data! This requires moving Gradients calculated on each Machine locally to a Single machine to sum up and get Total Gradient. Multiplied by the large number of iterations, this quickly becomes the new bottleneck e.g. weight updates in GBM, GLM, SGD, kMeans, EM, LowRank Models, Neural nets etc. similar bottleneck arises in case of Tree-based algos when splitting criterion needs to be calculated on Whole Training Data over & over again.

MANAGING BOTH PARALLELISM AT THE SAME TIME CAN BE TOUGH.

General Purpose Computing on Graphics Processing Units:-

General-purpose computing on graphics processing units (GPGPU, rarely GPGP) is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU).

https://en.m.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units

In short GPU — only graphics related operations. GPGPU — graphics + your own programming unit for anything that can be parallelized.

Modifying algorithms to allow certain of their tasks to take advantage of GPU parallelization can demonstrate noteworthy gains in both task performance and completion speed.

The GPGPU paradigm fits into Flynn’s taxonomy as single program, multiple data (SPMD) architecture, which differs from the traditional multicore CPU computing paradigm.

Machine learning algorithms could also see performance gains by parallelizing common tasks which may be shared among numerous algorithms, such as performing matrix multiplication, which is used by several classification, regression, and clustering techniques, including, of particular interest, linear regression.

CUDA Parallel Programming Framework :-

CUDA is a parallel computing platform and programming model that is developed by Nvidia offers its way of task parallel and data parallel computing models to give many options for a problem to be solved. You can push different jobs to same GPU concurrently or compute a one data parallel job with using all its resources, to finish it much quicker than a CPU does.

CUDA enables developers to speed up compute-intensive applications by harnessing the power of GPUs for the parallelizable part of the computation.

CUDA is a part of it, just like OpenCL and multithreading such as AMD, the combination of CUDA and Nvidia GPUs dominates several application areas, including deep learning, and is a foundation for some of the fastest computers in the world. CUDA and Nvidia GPUs have been adopted in many areas that need high floating-point computing performance, as summarized pictorially in the image above. A more comprehensive list includes:

**- Computational finance.
- Climate, weather, and ocean modeling.
- Data science and analytics.
- Deep learning and machine learning.
- Manufacturing/AEC (Architecture, Engineering, and Construction): CAD and CAE (including computational fluid dynamics, computational structural mechanics, design and visualization, and electronic design automation)
- Media and entertainment (including animation, modeling, and rendering; color correction and grain management; compositing; finishing and effects; editing; encoding and digital distribution; on-air graphics; on-set, review, and stereo tools; and weather graphics)
- Medical imaging
- Research: Higher education and supercomputing (including computational chemistry and biology, numerical analytics, physics, and scientific visualization)**

**
**There are yet other types of parallelism (for instance distributed memory), and those can not be programmed in CUDA. Moreover, CUDA is tied to NVidia, admittedly a large vendor of graphics hardware, but not the only one. Thus, CUDA is important, but it’s also proprietary and of limited applicability.
however, the memory management and device computation techniques, while they will be, by necessity, quite different by algorithmic situation, generalize to our various tasks: identify parallelizable computations, allocate device memory, copy data to device, perform parallelized computation, copy result back to host, continue with serial code. Note that the memory allocation and data copying overhead can easily become a bottleneck here, with any potential computational time savings stymied by these processes.

Algorithmic Applications in Machine Learning :-

If given the proper data, knowledge of algorithm implementation, and ambition, there is no limit to what you can attempt with **parallel processing** in machine learning. Of course, and as mentioned above, identifying parallelizable portions of code is the most difficult task, and there may not be any in a given algorithm.

A good place to start is **matrix multiplication**, as treated above, which is a well-used method for implementing linear regression algorithms. An implementation of linear regression on GPGPU can be found here. The paper “Performance Improvement of Data Mining in Weka through GPU Acceleration” notes speed increases, and the paper provides some additional insight into conceptualizing parallelism algorithmically.

Another common task used in machine learning which is ripe for parallelization is **distance calculation**. Euclidean distance is a very common metric which requires calculation over and over again in numerous algorithms, including *k*-means clustering. Since the individual distance calculations of successive iterations are not dependent on other calculations of the same iteration, these calculations could be performed in parallel (if we forget our memory management overhead as a potential bottleneck to contend with).

While these aforementioned shared statistical tasks could benefit from efficiency of execution, there is an additional aspect of the machine learning data flow which could potentially allow for even more significant gains. A common evaluation technique regularly employed in machine learning model validating is *k*-fold cross-validation, involving the intensive, not-necessarily sequential processing of dataset segments. **k****-fold cross-validation **is a deterministic method for model building, achieved by leaving out one of *k*segments, or folds, of a dataset, training on all *k*-1 segments, and using the remaining *k*th segment for testing; this process is then repeated *k* times, with the individual prediction error results being combined and averaged in a single, integrated model. This provides variability, with the goal of producing the most accurate predictive models possible.

This extensive model validating, when performed sequentially, can be relatively time-consuming, especially when each fold is paired with a computationally expensive algorithm task such as linear regression matrix multiplication. As *k*-fold cross-validation is a standard method for predicting a given machine learning algorithm’s error rate, attempting to increase the speed by which this prediction occurs seems particularly worthy of effort. A very high level view of doing so is implied previously in this article.

A consideration for those using Python goes beyond algorithm design, and relates to optimized native code and runtime comparisons with parallel implementations. While beyond the scope of this discussion, you may want to read more about this topic here.

Thinking algorithmically is necessary to leverage finite computational resources in any situation, and is no different with machine learning. With some clever thinking, an in-depth understanding of what you are attempting, and a collection of tools and their documentation, you never know what you may be able to achieve. Trust me when I say that, after doing some related work in grad school, finding opportunities to experiment with take much less time than you might think. Familiarize yourself with a code base, read a few tutorials, and get to work.

Recap

What I have presented here are the insights of What is the need of Parallel Processing for Machine Learning in Real Time. I hope you learned something today.

Always remember that solid business questions, clean and well-distributed data always beat fancy models.

Feel free to leave a message if you have any feedback, and share with anyone that might find this useful.