There are many things you can do to improve performance of your grid. There’re couple of things you need to check before throwing in money on additional hardware – CPU’s, memory, faster disks, or adding nodes.
Before we proceed, it’s important to emphasis that, in general, Grid is only responsible for:
- Scheduling of Jobs and tracking their execution status
- Distribution of Jobs to nodes
If your jobs are not running fast enough, there’re two possibilities
- Your job code has not been sufficiently optimized, or there’s a genuine performance issue in your job code.
- Bottleneck in your Grid infrastructure
The following passage is to provide a quick check list on the latter, with reference to graphics and architecture of grid from Applied Algo (https://appliedalgo.com) as example.
When troubleshooting performance issues in general, you start by asking yourself these two questions:
- Identify Where the bottleneck is
- What is the nature of the problem – is it CPU maxing out? Memory? Or Disk (Thrashing? Is it paging a lot? http://www.programmerinterview.com/index.php/operating-systems/how-virtual-memory-works)? Is your job code moving too much data across the different tiers?
Multiple Jobs referencing same data on same external data source? You may be better off running them sequentially – run one after another.
Input Data – Bottleneck in Database Tier?
- Partitioning Strategy: Jobs referencing data residing in same table, database? Partition your data across multiple data tables, data files on separate disks, multiple database instances/SQL Clusters.
- SQL Optimization: Review Query Execution Plan; add primary key, indexes, foreign keys to optimize joins.
Grid Load Balancer
- Node Affinity: Run fast jobs in one node group, slow jobs on another.
- Throttling settings? If your node is working too hard, for example long disk queue, pushing it by queuing up jobs it cannot start won’t help.
- Actual job running on the nodes optimized?
- Nodes proximity to input data – slow link? Your grid resides in the Cloud? What about your Data Source? Are you sending too much data over the wire?
- Perform preliminary operations (filtering, simple aggregations for example) on input data in SQL, BEFORE fetching data into the nodes to minimize traffic, optimize consumption.
- Excessive Logging?
- Excessive threading won’t help – you’d just get a lot of context switches. Try limit # threads to # CPU’s