Showing posts with label table. Show all posts
Showing posts with label table. Show all posts

Friday, 4 May 2012

Excel VBA Data Warehouse Generator Tool

Recently I was observing our database development team, and I observed that much of my colleagues were struggling with lots of boiler plate code. In the process of generating Dimension and Fact tables (in snow flaked schema), I thought it would be nice to create an easy to use tool to generate database schema with the following rules in place:

1. Each dimension table would have a primary key, combined with identity auto generation.
2. A dimension table may have referential constraint upon another dimension table
3. A fact table will have multiple referential constraints from multiple dimension tables but not from any other factual table.
4. A fact table will not have a primary key nor an identity column.

This tool is primarily an assisting tool for repetitive process occurred in the data warehouse table structure generation, though for any other design requirements developer intervention is very much required to meet the special conditions.



To use the tool, there is a main page where all the relevant information needs to be set like connection string, script folder path and updated database on which the operation needs to be performed once the list is refreshed.

Control Page:


Dimension Page:


Fact Page:


Then there is a SQL Server Management Studio style dimension and fact table generation worksheets where user can set the desired columns required according to the business needs along with integrated drop down list on the Table Link column where the refrence dimension tables can be selected(The table name for dimension and fact wil be appended by keywords "Dim" and "Fact" respectivly. And also for the dimension tables the key columns as they will be auto generated via VBA and appended terms like "Id").

Download Solution
Download solution

Sunday, 16 October 2011

MySQL Large Table: Split OR Partitioning???

For couple of past months this question has puzzled me at my work. And seeking answers over the internet, I wasn’t able to find a specific answer to the question. So in this post I would only like to highlight my experiences so far with my decision, with no bias to any particular method.

Table Split Vs Partitioning, this decision should be primarily based on the context of usage pattern of the database and type of queries being executed on the database on regular basis/users of the database.


When to split table into smaller tables:
• If the queried table is being scanned on non regular columns (i.e. the queries “Where” clause always changes to different columns within the table)
• If the queries are analytical in nature and direct users of the database are business users.
• If the partition mechanism has to span more than 1024 partition (MySQL limitation)

The disadvantage of splitting the table into multiple tables, highlight problems relevant to querying the database upon multiple tables (with usage of dynamic SQL within stored procedures), complex logic, creation of large number of tables and further more. But these problems outweigh the benefits achieved for analytical purposes once the system is set, keeping it in simple terms with each query upon spliced tables has fewer rows to scan physically and hence forth the results are faster with union all’ed result presented and consistent across any column scan involved in the query.

When to partition a table:
• If the queries are mostly regular in nature or database acts as a backend to the business system (i.e. the majority of queries “Where” clause is using the same column for scan within the table).
• The use of database if limited to storing of records and retrieval of records on standard parameters (i.e. non analytical purposes).
• Where database is being utilized by ORM mechanisms like ADO.NET/Hibernate.
• Foreign keys are not supported on partitioned table.

The disadvantage of partitioned table within an analytical environment is some times more detrimental in terms of performance than the advantages it results into. This is due to the fact when the column scans is performed on the partitioned table upon which the table is not partitioned is employs mysql more effort to scan the each partition of the table for the results and query execution is slower than the table split. But also to mention in spite of the partitioning mechanism used one should also take care of the mechanism of “Partition Pruning” related to the where clause in the select queries illustrating the mysql which partitions to scan for the result.

Performance Results:

In the experiment table contains 28,44,042 rows with "from_date" being indexed:
Note: all the tables in the example are partitioned on the “from_date” column in the table. 


#**Simple Table **
CREATE TABLE `salaries` (
 `emp_no` INT(11) NOT NULL,
 `salary` INT(11) NOT NULL,
 `from_date` DATE NOT NULL,
 `to_date` DATE NOT NULL,
 PRIMARY KEY (`emp_no`, `from_date`),
 INDEX `emp_no` (`emp_no`)
 )
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB
ROW_FORMAT=DEFAULT

#**Partition by month**
CREATE TABLE `salaries_copy` (
  `emp_no` int(11) NOT NULL,
  `salary` int(11) NOT NULL,
  `from_date` date NOT NULL,
  `to_date` date NOT NULL,
  PRIMARY KEY (`emp_no`,`from_date`),
  KEY `emp_no` (`emp_no`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPACT
/*!50100 PARTITION BY HASH (Month(from_date))
PARTITIONS 12 */

#**Partition by Range**
CREATE TABLE `salaries_copy_1` (
  `emp_no` int(11) NOT NULL,
  `salary` int(11) NOT NULL,
  `from_date` date NOT NULL,
  `to_date` date NOT NULL,
  PRIMARY KEY (`emp_no`,`from_date`),
  KEY `emp_no` (`emp_no`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPACT
/*!50100 PARTITION BY RANGE (to_days(from_date))
(PARTITION p0 VALUES LESS THAN (to_days('1985-01-01')) ENGINE = InnoDB,
 PARTITION p1 VALUES LESS THAN (to_days(‘1986-01-01’)) ENGINE = InnoDB,
 PARTITION p2 VALUES LESS THAN (to_days(‘1987-01-01’)) ENGINE = InnoDB,
 PARTITION p3 VALUES LESS THAN (to_days('1988-01-01')) ENGINE = InnoDB,
 PARTITION p4 VALUES LESS THAN (to_days('1989-01-01')) ENGINE = InnoDB,
 PARTITION p5 VALUES LESS THAN (to_days('1990-01-01')) ENGINE = InnoDB,
 PARTITION p6 VALUES LESS THAN (to_days('1991-01-01')) ENGINE = InnoDB,
 PARTITION p7 VALUES LESS THAN (to_days('1992-01-01')) ENGINE = InnoDB,
 PARTITION p8 VALUES LESS THAN (to_days('1993-01-01')) ENGINE = InnoDB,
 PARTITION p9 VALUES LESS THAN (to_days('1994-01-01')) ENGINE = InnoDB,
 PARTITION p10 VALUES LESS THAN (to_days('1995-01-01')) ENGINE = InnoDB,
 PARTITION p11 VALUES LESS THAN (to_days('1996-01-01')) ENGINE = InnoDB,
 PARTITION p12 VALUES LESS THAN (to_days('1997-01-01')) ENGINE = InnoDB,
 PARTITION p13 VALUES LESS THAN (to_days('1998-01-01')) ENGINE = InnoDB,
 PARTITION p14 VALUES LESS THAN (to_days('1999-01-01')) ENGINE = InnoDB,
 PARTITION p15 VALUES LESS THAN (to_days('2000-01-01')) ENGINE = InnoDB,
 PARTITION p16 VALUES LESS THAN (to_days('2001-01-01')) ENGINE = InnoDB,
 PARTITION p17 VALUES LESS THAN (to_days('2002-01-01')) ENGINE = InnoDB,
 PARTITION p18 VALUES LESS THAN (to_days('2003-01-01')) ENGINE = InnoDB,
 PARTITION p19 VALUES LESS THAN (to_days('2004-01-01')) ENGINE = InnoDB,
 PARTITION p20 VALUES LESS THAN (to_days('2005-01-01')) ENGINE = InnoDB,
 PARTITION pmax VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */

#********* Test 1: Queries scanning the partitioned column: from_date ***********
Select SQL_NO_CACHE * From salaries tbl
where tbl.from_date >= '2000-03-15' and tbl.from_date < '2000-09-25';
#Duration for 1 query: 0.016 sec. (+ 2.137 sec. network)

Select SQL_NO_CACHE * From salaries_copy tbl
where tbl.from_date >= '2000-03-15' and tbl.from_date < '2000-09-25';
#Duration for 1 query: 2.106 sec. (+ 5.288 sec. network)

Select SQL_NO_CACHE * From salaries_copy_1 tbl
where tbl.from_date >= '2000-03-15' and tbl.from_date < '2000-09-25';
#Duration for 1 query: 0.063 sec. (+ 1.185 sec. network)

Select SQL_NO_CACHE * From salaries_1985
where salaries_1985.from_date >= '2000-03-15' and salaries_1985.from_date < '2000-09-25'
 UNION ALL 
Select * From salaries_1986
where salaries_1986.from_date >= '2000-03-15' and salaries_1986.from_date < '2000-09-25'
 UNION ALL …
…
Select * From salaries_2005
where salaries_2005.from_date >= '2000-03-15' and salaries_2005.from_date < '2000-09-25';
#Duration for 1 queries: 1.638 sec. (+ 0.484 sec. network)

#********* Test 2: Queries scanning the non partitioned column: to_date ***********
Select SQL_NO_CACHE * From salaries tbl
where tbl.to_date >= '2000-03-15' and tbl.to_date < '2000-09-25';
#Duration for 1 query: 0.109 sec. (+ 2.762 sec. network)

Select SQL_NO_CACHE * From salaries_copy tbl
where tbl.to_date >= '2000-03-15' and tbl.to_date < '2000-09-25';
#Duration for 1 query: 1.201 sec. (+ 6.521 sec. network)

Select SQL_NO_CACHE * From salaries_copy_1 tbl
where tbl.to_date >= '2000-03-15' and tbl.to_date < '2000-09-25';
#Duration for 1 query: 7.472 sec. (+ 3.058 sec. network)

Select SQL_NO_CACHE * From salaries_1985
where salaries_1985.to_date >= '2000-03-15' and salaries_1985.to_date < '2000-09-25'
 UNION ALL 
Select * From salaries_1986
where salaries_1986.to_date >= '2000-03-15' and salaries_1986.to_date < '2000-09-25'
 UNION ALL …
…
Select * From salaries_2005
where salaries_2005.to_date >= '2000-03-15' and salaries_2005.to_date < '2000-09-25';
#Duration for 1 query: 1.670 sec. (+ 0.483 sec. network)

Comparison Table:

Query on “from_date Query on “to_date Description
Indexed column Non-Indexed column -
0.016 sec. (+ 2.137 sec. network) 0.109 sec. (+ 2.762 sec. network) Simple table
2.106 sec. (+ 5.288 sec. network) 1.201 sec. (+ 6.521 sec. network) Partition by HASH (Month(from_date))
0.063 sec. (+ 1.185 sec. network) 7.472 sec. (+ 3.058 sec. network) Partition by RANGE (to_days(from_date))
1.638 sec. (+ 0.484 sec. network) 1.670 sec. (+ 0.483 sec. network) Table Split by year(from_date)


Though I do not wish to end this trail in here and I would like to know reader opinions and thoughts about this topic and shed more light whether the table split is better than partitioning or vice versa.

Look forward to your comments …

References: 
Link1: http://dev.mysql.com/doc/refman/5.1/en/partitioning.html