Big Data, Small Data, Grid Computing, Cloud Computing, Distributed Computation vs Distributed Persistence

Hadoop (http://hadoop.apache.org) has gained a lot of popularity in recent years – and have claimed to throne in Grid Computing. There’s has been a lot of confusion what’s meant by Grid, Load Balancing, Big Data, Cloud…etc. There’re grids that’s geared towards persisting, parsing and analyzing non-relational data (Social media, web scrapping …etc) – Hadoop is one such example. There’re software vendors that cater for simple Enterprise workflow (i.e. Scheduling, Job Chaining) – BMC Control-M, Schedulix for instance (Here’s a decent survey http://www.softpanorama.org/Admin/job_schedulers.shtml). There’re also data platform that’s geared towards Numerical and Quantitative Analysis (Data in relational format)Applied Algo ETL Suite. How do we decide what’s suitable for what purpose?

What is Big Data?

We will start with this easy one – Big Data is just Big Data. Wikipedia says Big Data is simply a lot of data: “”Big data is high volume, high velocity, and/or high variety information assets … Big data[1][2] is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. “ http://en.wikipedia.org/wiki/Big_data.

That’s not very helpful and we need a better answer than that do we? Big Data is…

a. Not Terabytes of data, but Petabytes of data

b. Data Platform which allows efficient storage (and retrieval) of such amount of Data. Typically, this means Distributed Storage.

c. Grid Computing Platform – which allows parallel processing against “Big Data” (Or, “Small Data”).

Find out if your data size is in range of hundred of GB’s? TB? Or really in PB range. Ask question, Where’s my data coming from? Am I purging and archiving away old obsolete entries from my database?

There’re many “Big Data” articles merely trying to establish a use case for “Big Data” technologies where it’s not needed. Example, http://www.syoncloud.com/case_study

First off, 50GB is not from “Big Data”. Secondly, my humble experience from brokerages, hedge funds, investment banks – nobody has “Big Data” (And Hadoop), except edge cases such as Exchanges, or firms mining the Internet.

So called “Traditional Approach” in the article where intra-application feeds by csv/xml/Excel/txt is tried and true approach still practiced in many of the biggest investment banks – support personnel loves it, because in event of issue, they can simply track down problem by opening the feed file. Further, there’re those where market data, positions, risks are communicated among different components in application chain by way of Message Bus (Tibco rendezvous for instance). There are smaller infrastructures where different components talks over Web Services or sockets.

There’re also others where a convincing use case is presented – imagine Mining companies collecting seismic data sets over wide geographical area (Many probes sending in periodic updates)?

http://www.teradata.com/articles/Data-Analytics-for-Oil-and-Gas-The-Business-Case/?utm_content=buffer3b455&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

Unless you’re Big Brother, or your business is Social Media, News/Media Mining, Retail businesses such as Amazon, or Credit Card companies, Security audits (Scans & surveillance can generate a lot of data), Real-time analysis of data stream from Mobile & Wearable devices (Ask why you’re persisting the data), IoT (Internet of Things) … etc- Firms with public, unstructured data. Otherwise, don’t rush off to a Big Data implementation.

“Don’t use Hadoop – your data isn’t that big by Chris Stucchio

http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

What is Load Balancing?

If you look around, “Load Balancing” is typically used in context of “Network Load Balancing” – which isn’t “Grid Computing”http://en.wikipedia.org/wiki/Load_balancing_(computing)

What is a Grid?

1. Distributed Computation – that is calculation/computation can be distributed around across machines in server farm. Distribution algorithms: server status (CPU/Memory/Disk Activity), round robin, node affinity ..etc. And, proximity to input data.

Hadoop’s Mapreduce algorithm (Excellent illustration here, http://ayende.com/blog/4435/map-reduce-a-visual-explanation) is closely tied with “Distributed Persistence” where slaves (Task Trackers) running on multiple machines, fetching input data from multiple “Data Nodes”, executes concurrently towards a single goal – i.e. Stateful Distribution of Computation Load for Incremental Analytics. This is particularly fit for Big Data analysis: Social Media (Example Facebook), news mining (Example Reuters).

Stateless Distribution, on the contrary, distributes loads without consideration to node’s proximity to input data.

2. Distributed Persistence – (a) Input data can be fetched from multiple sources residing on multiple machines, (b) Output data can be persisted on multiple machines. Hadoop HDFS is one such implementation where NameNode acts as Central Registry of all DataNodes. DataNodes stores data locally on disk. NoSQL (http://nosql-database.org)/MongoDb and Hadoop represents the leader in recent movement towards persistence in non-relational format.

3. Scheduling (Aka “Batch Processing”, “Workload Automation”)

Scheduler determines when to execute jobs. Most modern scheduler supports Job Chaining. Parent-Child job hierarchy and conditional execution with execution tracking. The word “Scheduling” in context of Quantitative Finance carries quite different meaning general IT (Infrastructure), which refers to it as “Workload Automation”, or “Batch Processing” – for example, daily cleanup log folders, shrink database, clean import/export folders, purge stale data, scan sensitive folders permissions…etc, Infrastructure/Platform related tasks in general. On the contrary, typical EOD (End of Day) batch in a Hedge Fund/Investment Bank Derivative Trading desks includes:

STEP 1. Mark to Market

STEP 2. Pnl Calculation

STEP 3. EOD Pricing

STEP 4. Stressing and Scenario Analysis

STEP 5. Downstream Feed

Quartz is the most prominent Open Source library at the time of writing, supporting both job scheduling and load balancing (aka “Cluster”), with specific limitation (For example Quartz.NET – jobs must be .NET and implement IJob interface). Further all Open Source libraries lacks GUI and persistence of anything (Schedules, Execution Status/Timestamp/History, Parameters, output results..etc). Standalone solutions: BMC Control-M (Commercial, approx >USD20k base), Applied Algo ETL Suite (Commercial ~USD1000 base, depending on how many nodes in server farm), Schedulix (Open Source + consulting fees).

4. GUI

Grid is NOT a GUI and GUI isn’t a mandatory component of Grid. But one observation is the absence of GUI for Cloud based solutions. For Hadoop, web based user interface under separate projects (Pig and Oozie) which you’d need to download, install and configure separately and is geared towards Hadoop specialists.

What is Cloud?

Cloud doesn’t means “Grid” – it simply means it’s a Hosted Platform. But Grid can be hosted in “The Cloud”. Google Compute Engine for example is a Grid Hosted in The Cloud. https://developers.google.com/compute/docs/load-balancing/

Financial sector has been slow to adopt due to Regulatory and Compliance constraints, but we’re seeing incremental additions. AWS Nasdaq Finqloud: http://aws.amazon.com/solutions/case-studies/nasdaq-finqloud/

Before anyone embark on a Cloud implementation – Is your data accessible from The Cloud? If not, you can forget it.

What’s Out There?

Here’s a list of Open Source and Commercial “Grids” and “Schedulers” – so can run of The Cloud, others are meant for local installation. Hadoop is no doubt leader in Big Data (Despite some bad presses in usability, and thus overall implementation cost: http://www.forbes.com/sites/danwoods/2012/07/27/how-to-avoid-a-hadoop-hangover)  with legions of vendors building on top (Hortonworks, Cloudera, Microsoft HDInsight…etc). You also have BMC Control-M (http://bmc.com) for traditional Enterprises Workflow. Applied Algo ETL Suite (https://appliedalgo.com) designed and priced for smaller Application Teams, in particular, Analytic/Quantitative Analysis with Automatic Persistence of numerical output in tabula format (Relational Database).

https://appliedalgo.com/appliedalgoweb/Doc/CompetitiveAnalysis/AppliedAlgo.Scheduler_LoadBalancer.FeatureComparison.htm

Is Cloud preferred over The Grid?

Simply put,

  • Cloud = Grid that resides in the Internet, and
  • Grid = Grid that resides in the Intranet.

Despite Cloud advocates pushing the idea Cloud is the next big thing and everything should run on Cloud (http://gavinbadcock.wordpress.com/2012/11/22/cloud-vs-grid-computing-why-are-we-leaving-the-grid-behind), it isn’t everything. Well, still, is it most of it?

Well, in terms of data size in #bytes, perhaps – after all, Social Media and Unstructured public information/raw data – maybe yes, Internet is infinitely vast. However, all financial/accounting/statistical data are still stored in relational format/databases and for security/compliance reasons and technical practicality, it’s going to remain this way. The Private Grid isn’t going away.

Now, in terms of speed, anyone who claim calculation runs faster in The Cloud is just clueless. I am not going to even attempt to argue for or against it!

This brings us to the core question: Would you [Rent] your apartment, or [Buy] it? Right answer depends on your purposes and intent, despite all the Hype, Cloud isn’t for everything. Renting isn’t always cheaper than Buying.

Simple math:

  • Google Compute Engine (Cloud) USD1000 per month base + implementation cost for scheduling/integration/GUI
  • BMC Control-M USD20k base + 1000 schedules @USD200 each (i.e. USD 200k)
  • Schedulix – free software + USD 10k for initial setup/consulting
  • Applied Algo ETL Suite USD1250 + hardware + intial implementation

One year tenure, while BMC Control M may be a bit expensive, Google Compute Engine isn’t really a lot more cost effective. Given it a couple years, even BMC be cheaper than Google Compute Engine – One catch: Private Grids generally requires dedicated staff to maintain – this will add to operating cost. So here, Google Compute Engine may still be cheaper than BMC Control-M. However, you still can’t beat Schedulix and Applied Algo. Further, if your data is not accessible from IP addresses outside corporate Firewall, Cloud based solution is simply not viable. General rule of thumb, in terms of Security and Performance, your data should be close to your Grid.

  • Google Compute Engine (Cloud) = USD12k + Development cost (Scheduling/GUI/Persistence..etc – say 365MD to get it built) = USD150k
  • BMC Control-M USD 250k
  • Schedulix USD 20k (Free software, Hardware + consulting)
  • Applied Algo ETL Suite USD2k + hardware = USD 10k

Notice, none of above is “Big Money” – To put it in perspective, USD200k is equivalent to just two mid-range developers for One year. Question is, would you pay USD200k for a grid?

Fundamentally, when you hear advocates claiming Cloud is cheaper than Private Grid – cross examine the maths. Pay attention to Overall Life-Cycle Cost, not just up-front/monthly recurring cost.

More inspiration: http://java.dzone.com/articles/compute-grids-vs-data-grids

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s