ETL

Reverse Engineering Data-flow in a Data Platform with Thousands of tables?

Ever been tasked to inherit, or migrate an existing (legacy) Data Platform? There are numerous Open Source (Hadoop/Sqoop, Schedulix …) and Commercial tools (BMC Control-M, Appliedalgo.com, stonebranch …etc) which can help you operate the Data Platform – typically gives you multitude of platform services:
• Job Scheduling
• ETL
• Load Balancing & Grid Computing
• Data Dictionary / Catalogue
• Execution tracking (track/persist job parameters & output)

Typical large scale application has hundreds to thousands of input data files, queries and intermediate/output data tables.
DataPlatform_DataflowMapping

Mentioned Open Source and Commercial packages facilitates operation of Data Platform. Tools which helps generates ERD diagrams typically relies on PK-FK relationships being defined – but of course more often than not this is not the case. Example? Here’s how you can Drag-drop tables in a Microsoft SQL Server onto a Canvas to create ERD – https://www.youtube.com/watch?v=BNx1TYItQn4
DataPlatform_DataflowMapping_Who

If you’re tasked to inherit or migrate such Data Platform, first order of business is to manually map out data flow. Why? To put in a fix, or enhancement, you’d first need to understand data flow before any work can commence.

And, that’s a very expensive, time consuming proposition.

There’re different ways to tackle the problem. Here’s one (Not-so-Smart) option:
• Manually review database queries and stored procedures
• Manually review application source code and extract from it embedded SQL statements

Adding to complexity,
• Dynamic SQL
• Object Relational Mapper (ORM)

The more practical approach would be to employ a SQL Profiler. Capture SQL Statements executed, and trace the flow manually. Even then, this typically requires experienced developers to get the job done (Which isn’t helping when you want to keep the cost down & delivery lead time as short as possible). As such undertaking is inherently risky – as you can’t really estimate how long it’ll take to map out the flow until you do.

There’s one command line utility MsSqlDataflowMapper (Free) from appliedalgo.com which can help. Basically, MsSqlDataflowMapper takes SQL Profiler trace file as input (xml), analyze captured SQL Statements. Look for INSERT’s and UPDATE’s. Then automatically dump data flow to a flow chart (HTML 5). Behind the scene, it uses SimpleFlowDiagramLib from Gridwizard to plot the flow chart – https://gridwizard.wordpress.com/2015/03/31/simpleflowdiagramlib-simple-c-library-to-serialize-graph-to-xml-and-vice-versa/

Limitation?
• Microsoft SQL Server only (To get around this, you can build your own tool capture SQL statements against Oracle/Sybase/MySQL…etc, analyze it, look up INSERT’s and UPDATE’s, then route result to SimpleFlowDiagramLib to plot the flow chart)
MsSqlDataflowMapper operates on table-level. It identify source/destination tables in process of mapping out the flow. However, it doesn’t provide field-level source information (a particular field in output table comes from which source tables?)
• The tool does NOT automatically *group* related tables into different Regions in diagram (This requires a lot more Intelligence in construction of the tool – as we all know, parsing SQL is actually a very complex task! https://gridwizard.wordpress.com/2014/11/08/looking-for-a-sql-parser-for-c-dotnet). At the end of the day, it still takes skilled developer to Make Sense of the flow.

Happy Coding!

Advertisements

Java and dotnet Interop

This article is about Java-dotnet Interop. We’ll explore what options we have for different scenario where interop is required.

First, when we say “Java-dotnet Interop”, there are two possibilities:

1. Java -to- dotnet communications

2. dotnet -to-Java communications

Secondly, we assume, if you’re developing in Java, you’d run it on Linux (Or simply put, if your application written in Java, why would it run on Windows?)

Given above, what are our options?

 

1. Socket

Anand Manikiam has written a piece on this subject, http://www.codeproject.com/Articles/11602/Java-and-Net-interop-using-Sockets

The pros for this approach are:

a. No middle-ware

b. Fast

The cons are:

a. Resiliency

b. Casting complex object/classes from byte[]?

c. Message security? Encryption? Anti-tampering? DOS? If not implemented this be Intranet application only.

 

2. Web Services

I’ve written an article of consuming Java-ws from dotnet:

https://gridwizard.wordpress.com/2014/12/26/java-ws-and-dotnet-interop-example/

You will also find plenty of discussions on consuming WCF-from-Java:

http://www.codeproject.com/Articles/777036/Consuming-WCF-Service-in-Java-Client

The pros for this approach are:

a. No middle-ware

b. Higher level of compatibility with code coded in more languages (C++/SOAP, Python, R …etc)

The cons are:

a. Less fast than socket

b. Resiliency

c. Message security? Encryption? Anti-tampering? DOS? If not implemented this be Intranet application only.

d. Slower than Socket! (Web Services overhead)

 

3. Message Bus

RabbitMQ (http://www.rabbitmq.com) is all about Messaging. If you’re developing real time applications, RabbitMQ offers high performance battle tested communication platforms and it as an API for just about any language on the planet. C++, dotnet, Java, Perl, Python…

Pros are:

a. Resiliency – producers and consumers can die and crash at any moment.

b. Performance

cons:

a. You need install Middleware, and if you’re a software vendor, you’d need bundle installation of RabbitMQ with your application

 

4. Commercial Tools

Depending on what you’re building, if what you’re trying to build is a computing grid, then there are commercial tools which allows you to run jobs on basically any platform, coded in any language.

Appliedalgo.com for instances supports:

a. Scheduling, conditional job chaining and Workload Automation

b. Grid Computing – nodes/slaves on any platform/language

c. Automatic persistence of run history, parameters, input and results

(Even configure cell level validations by “IsNumber”, or use of user specified Regular Expression)

d. GUI for you to track run parameters, input and results

However, such tools inevitably introduces execution overhead. So depending on whether you’re …

a. Executing high number of light weight jobs –> Probably should not use any tool besides a Message bus such as RabbitMQ

b. Executing medium number of medium weight jobs –> Best application of Workload Automation Data Platforms such as Appliedalgo.com

c. Executing low number of heavy weight jobs –> Best custom coded, persistence via BCP (There’s no other way for million rows or #bigdata processing)

 
But this would not be a viable option for instance if you’re building a hotel booking system with web tier built in ASP.NET and backend in Java with Java-ws

Happy Coding!

 

 

TSQL – Utility to read csv and Excel files

/*
If you see this error?
The 32-bit OLE DB provider “Microsoft.ACE.OLEDB.12.0″ cannot be loaded in-process on a 64-bit SQL Server.

To resolve this error you need to get the “Access Database Engine 2010″ to facilitate the transfer of data between existing Microsoft Office files such as “Microsoft Office Access 2010″ (*.mdb and *.accdb) files and “Microsoft Office Excel 2010″ (*.xls, *.xlsx, and *.xlsb) files to other data sources such as “Microsoft SQL Server”.
Get the Microsoft Access Database Engine 2010 Redistributable [http://www.microsoft.com/en-us/download/details.aspx?id=13255]

http://sqlwithmanoj.com/tag/microsoft-ace-oledb-12-0/
*/
IF EXISTS (SELECT * FROM sys.objects WHERE type = ‘P’ AND name = ‘sp_ReadCSVFile’)
DROP PROCEDURE sp_ReadCSVFile
GO

IF EXISTS (SELECT * FROM sys.objects WHERE type = ‘P’ AND name = ‘sp_ReadExcelFile’)
DROP PROCEDURE sp_ReadExcelFile
GO

IF EXISTS (
SELECT * FROM sysobjects WHERE id = object_id(N’fn_SplitString’)
AND xtype IN (N’FN’, N’IF’, N’TF’)
)
DROP FUNCTION fn_SplitString
GO

/*

Remember to put a file Schema.ini in same folder as csv file. Also ‘@TargetTable’ parameter should be csv file name (but without Extension)

exec sp_ReadCSVFile ‘C:\tmp’, ‘EmpData’

[EmpData.csv]
FORMAT=Delimited(;)
ColNameHeader=True
MaxScanRows=0
CharacterSet=ANSI
TextDelimiter=`
DecimalSymbol=.
COL1=EmpId TEXT
COL2=FirstName TEXT
COL3=LastName TEXT
COL4=MiddleName TEXT
COL5=EmailAddr TEXT
*/
CREATE PROCEDURE sp_ReadCSVFile
(
@TargetFolder varchar(255),
@TargetTable varchar(255)
)
AS
BEGIN
declare @SqlStmt nvarchar(max)

set @SqlStmt = ‘IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N”[dbo].[‘ + @TargetTable + ‘]”) AND type in (N”U”))’
set @SqlStmt = @SqlStmt + ‘ TRUNCATE TABLE ‘ + @TargetTable
PRINT @SqlStmt
exec sp_executesql @SqlStmt

set @SqlStmt = ‘ SELECT * INTO ‘ + @TargetTable
+ ‘ FROM OpenDataSource (”Microsoft.ACE.OLEDB.12.0”, ”Data Source=”‘ + @TargetFolder + ‘”;Extended properties=Text”)…[‘
+ @TargetTable + ‘#csv]’
PRINT @SqlStmt
exec sp_executesql @SqlStmt
END
GO

/*
Usage:
exec sp_ReadExcelFile ‘C:\tmp’, ‘EmpData’, ‘EmpData.xls’, ‘[Sheet1$]’, ‘Col1,Col2,Col3,Col4,Col5’
*/
CREATE PROCEDURE sp_ReadExcelFile
(
@TargetFolder varchar(255), /* For example, ‘C:\tmp’ */
@TargetTable varchar(255), /* For example, ‘EmpData’ */
@ExcelFile varchar(255), /* For example, ‘EmpData.xls’ */
@ExcelSheet varchar(255), /* For example, ‘[Sheet1$]’ */
@ExcelFields varchar(8000) /* Comma separate list, for example: ‘Col1,Col2,Col3,Col4,Col5’ */
)
AS
BEGIN
declare @SqlStmt nvarchar(max)
declare @FirstColumn nvarchar(255)

set @SqlStmt = ‘IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N”[dbo].[‘ + @TargetTable + ‘]”) AND type in (N”U”))’
set @SqlStmt = @SqlStmt + ‘ DROP TABLE ‘ + @TargetTable
PRINT @SqlStmt
exec sp_executesql @SqlStmt

select top 1 @FirstColumn=[part] from fn_SplitString(@ExcelFields, ‘,’)
print ‘FirstColumn: ‘ + @FirstColumn

set @SqlStmt = ‘SELECT ‘ + @ExcelFields + ‘ INTO ‘ + @TargetTable + ‘ FROM OPENROWSET(”Microsoft.ACE.OLEDB.12.0”,
”Excel 12.0;DATABASE=’ + @TargetFolder + ‘\’ + @ExcelFile + ‘;IMEX=1”, ”Select * from ‘ + @ExcelSheet + ”’)’
+ ‘WHERE NOT ‘ + @FirstColumn + ‘ IS NULL’
PRINT @SqlStmt
exec sp_executesql @SqlStmt
END
GO

/*
Taken from: http://stackoverflow.com/questions/697519/split-function-equivalent-in-t-sql
Usage: select top 1 part from fn_SplitString(‘aaa,bbb,ccc,ddd,eee’,’,’)
*/
CREATE FUNCTION [dbo].[fn_SplitString]
(
@sString nvarchar(2048),
@cDelimiter nchar(1)
)
RETURNS @tParts TABLE ( part nvarchar(2048) )
AS
BEGIN
if @sString is null return
declare @iStart int,
@iPos int
if substring( @sString, 1, 1 ) = @cDelimiter
begin
set @iStart = 2
insert into @tParts
values( null )
end
else
set @iStart = 1
while 1=1
begin
set @iPos = charindex( @cDelimiter, @sString, @iStart )
if @iPos = 0
set @iPos = len( @sString )+1
if @iPos – @iStart > 0
insert into @tParts
values ( substring( @sString, @iStart, @iPos-@iStart ))
else
insert into @tParts
values( null )
set @iStart = @iPos+1
if @iStart > len( @sString )
break
end
RETURN
END
GO

DataTable to-from HTML Table

Here’s a small handy utility which converts a C# DataTable to/from HTML Table. Happy Coding!

using System;
using System.Collections.Generic;
using System.Data;
using System.Globalization;
using System.Text;
using System.Xml;

namespace ClientUtil
{
public class DataTableUtil
{

public static string DataTableToXmlString(DataTable dtData)
{
if (dtData == null || dtData.Columns.Count == 0)
return (string) null;
DataColumn[] primaryKey = dtData.PrimaryKey;
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.Append(“<TABLE>”);
stringBuilder.Append(“<TR>”);
foreach (DataColumn dataColumn in (InternalDataCollectionBase) dtData.Columns)
{
if (DataTableUtil.IsPrimaryKey(dataColumn.ColumnName, primaryKey))
stringBuilder.Append(“<TH IsPK=’true’ ColType='”).Append(Convert.ToString(dataColumn.DataType == typeof (object) ? (object) typeof (string) : (object) dataColumn.DataType)).Append(“‘>”).Append(dataColumn.ColumnName.Replace(“&”, “”)).Append(“</TH>”);
else
stringBuilder.Append(“<TH IsPK=’false’ ColType='”).Append(Convert.ToString(dataColumn.DataType == typeof (object) ? (object) typeof (string) : (object) dataColumn.DataType)).Append(“‘>”).Append(dataColumn.ColumnName.Replace(“&”, “”)).Append(“</TH>”);
}
stringBuilder.Append(“</TR>”);
int num1 = 0;
foreach (DataRow dataRow in (InternalDataCollectionBase) dtData.Rows)
{
stringBuilder.Append(“<TR>”);
int num2 = 0;
foreach (DataColumn dataColumn in (InternalDataCollectionBase) dtData.Columns)
{
string str = Convert.IsDBNull(dataRow[dataColumn.ColumnName]) ? (string) null : Convert.ToString(dataRow[dataColumn.ColumnName]).Replace(“<“, “&lt;”).Replace(“>”, “&gt;”).Replace(“\””, “&quot;”).Replace(“‘”, “&apos;”).Replace(“&”, “&amp;”);
if (!string.IsNullOrEmpty(str))
stringBuilder.Append(“<TD>”).Append(str).Append(“</TD>”);
else
stringBuilder.Append(“<TD>”).Append(“</TD>”);
++num2;
}
stringBuilder.Append(“</TR>”);
++num1;
}
stringBuilder.Append(“</TABLE>”);
return ((object) stringBuilder).ToString();
}

protected static bool IsPrimaryKey(string ColumnName, DataColumn[] PKs)
{
if (PKs == null || string.IsNullOrEmpty(ColumnName))
return false;
foreach (DataColumn dataColumn in PKs)
{
if (dataColumn.ColumnName.ToLower().Trim() == ColumnName.ToLower().Trim())
return true;
}
return false;
}

public static DataTable XmlStringToDataTable(string XmlData)
{
DataTable dataTable = (DataTable) null;
IList<DataColumn> list = (IList<DataColumn>) new List<DataColumn>();
if (string.IsNullOrEmpty(XmlData))
return (DataTable) null;
XmlDocument xmlDocument1 = new XmlDocument();
xmlDocument1.PreserveWhitespace = true;
XmlDocument xmlDocument2 = xmlDocument1;
xmlDocument2.LoadXml(XmlData);
XmlNode xmlNode1 = xmlDocument2.SelectSingleNode(“/TABLE”);
if (xmlNode1 != null)
{
dataTable = new DataTable();
int num = 0;
foreach (XmlNode xmlNode2 in xmlNode1.SelectNodes(“TR”))
{
if (num == 0)
{
foreach (XmlNode xmlNode3 in xmlNode2.SelectNodes(“TH”))
{
bool result = false;
string str = xmlNode3.Attributes[“IsPK”].Value;
if (!string.IsNullOrEmpty(str))
{
if (!bool.TryParse(str, out result))
result = false;
}
else
result = false;
Type type = Type.GetType(xmlNode3.Attributes[“ColType”].Value);
DataColumn column = new DataColumn(xmlNode3.InnerText, type);
if (result)
list.Add(column);
if (!dataTable.Columns.Contains(column.ColumnName))
dataTable.Columns.Add(column);
}
if (list.Count > 0)
{
DataColumn[] dataColumnArray = new DataColumn[list.Count];
for (int index = 0; index < list.Count; ++index)
dataColumnArray[index] = list[index];
dataTable.PrimaryKey = dataColumnArray;
}
}
else
{
DataRow row = dataTable.NewRow();
int index = 0;
foreach (XmlNode xmlNode3 in xmlNode2.SelectNodes(“TD”))
{
Type dataType = dataTable.Columns[index].DataType;
string s = xmlNode3.InnerText;
if (!string.IsNullOrEmpty(s))
{
try
{
s = s.Replace(“&lt;”, “<“);
s = s.Replace(“&gt;”, “>”);
s = s.Replace(“&quot;”, “\””);
s = s.Replace(“&apos;”, “‘”);
s = s.Replace(“&amp;”, “&”);
row[index] = Convert.ChangeType((object) s, dataType);
}
catch
{
if (dataType == typeof (DateTime))
row[index] = (object) DateTime.ParseExact(s, “yyyyMMdd”, (IFormatProvider) CultureInfo.InvariantCulture);
}
}
else
row[index] = Convert.DBNull;
++index;
}
dataTable.Rows.Add(row);
}
++num;
}
}
return dataTable;
}
}
}

That’s it. You may also want to check out how to convert DataTable to JSON string?
http://www.codeproject.com/Articles/777284/Consuming-JSON-ASMX-Web-Services-with-AngularJS-A

Taken’s from Mohamed Kamal Pharm’s article – first you convert DataTable.Rows into a List, the use System.Web.Script.Serialization.JavaScriptSerializer to render the list to a string:

public String ConvertDataTableTojSonString(DataTable dataTable)
{
System.Web.Script.Serialization.JavaScriptSerializer serializer =
new System.Web.Script.Serialization.JavaScriptSerializer();

List<Dictionary<String, Object>> tableRows = new List<Dictionary<String, Object>>();

Dictionary<String, Object> row;

foreach (DataRow dr in dataTable.Rows)
{
row = new Dictionary<String, Object>();
foreach (DataColumn col in dataTable.Columns)
{
row.Add(col.ColumnName, dr[col]);
}
tableRows.Add(row);
}
return serializer.Serialize(tableRows);
}

You may also want to checkout how to serialize a graph to/from xml! https://gridwizard.wordpress.com/2015/03/31/simpleflowdiagramlib-simple-c-library-to-serialize-graph-to-xml-and-vice-versa