Building a Data Dictionary for SQL Server, MySQL, and Postgres
Table of Contents
TL;DR - Give me the code, now!
Introduction
As a data analyst in a large enterprise, we have to touch tens of application databases, and each database would have tens and hundreds of tables. Building a data dictionary is a crucial step to help discover what data is available.
At first sight, this might seem like a trivial problem involving queries to many information schema tables, but as we have identified several additional attributes which would be useful to determine which table/data source is useful, we found those information simply not available in information schemas (such as column descriptions in SQL Server).
And to take this to a further level, since our databases are centrally registered on one main SQL Server as linked server, we would like to fully automate and implement a stored procedure to crawl and collect all schema information into a single data dictionary table.
In this post, we will automate the building of data dictionary on a SQL Server, by collecting database schemas from local databases and databases from SQL Server, MySQL and PostgreSQL linked servers. Since the script is written in declarative style, you could easily expand and adapt to other linked DBMS solutions. You can also use the individual SQL per DBMS if you are not interested in building a central data dictionary on a SQL server.
This post is tested against SQL Server 2016
What we will be collecting
With the basics such as table name, column name and data type in mind, we have decided to collect the following information which would be beneficial to identify the purpose of the column and overall quality of the data source:
- Foreign Key relationships - This will be helpful to identify table references and the corresponding primary key column, especially when the foreign key column is not named predictably
- Table and Column description - In each DBMS, comments and description could be added to tables and columns, and those might already be populated by the application developer. For example, when I designed one of my department’s application, the columns’ description are written in SSDT and subsequently written to the database
- (Estimated) Table Row count - This will be helpful to identify (1) in case of replication, tables that were not replicated; and (2) tables belonging to features which were not used. Collecting actual row count would be a performance nightmare, so we will turn to alternative methods as we outline per each DBMS below
Note: The foreign key information is not collected for Postgres because I am not familiar enough with Postgres to write that part of the query, you are welcomed to contribute!
The code
DataDictionary table structure
First of all, we need to define the data dictionary’s schema, below is the schema we will be using. In additional to the fields and information named above, we have added [ColumnDescriptionDD]
, which is a special column for us to manually put in our remarks, and this field will be persisted across automatic schema pulling.
CREATE TABLE [dbo].[DataDictionary]
(
[SystemName] VARCHAR(255) NOT NULL ,
[ServerName] VARCHAR(255) NOT NULL,
[DatabaseName] VARCHAR(255) NOT NULL,
[SchemaName] VARCHAR(255) NOT NULL,
[TableName] VARCHAR(255) NOT NULL,
[ColumnName] VARCHAR(255) NOT NULL,
[OrdinalPosition] INT NULL,
[DataType] VARCHAR(255) NULL,
[ForeignKeyTo] VARCHAR(255) NULL,
[TableDescription] NVARCHAR(1024) NULL,
[ColumnDescription] NVARCHAR(1024) NULL,
[TableRowCount] BIGINT NULL,
[ColumnDescriptionDD] NVARCHAR(1024) NULL,
[FirstSeenAt] DATETIME2(0) NOT NULL DEFAULT CURRENT_TIMESTAMP,
[LastSeenAt] DATETIME2(0) NULL,
CONSTRAINT [PK_DataDictionary] PRIMARY KEY NONCLUSTERED ([ServerName], [DatabaseName], [SchemaName], [TableName], [ColumnName])
)
CREATE CLUSTERED INDEX [IX_DataDictionary_Clustered] ON [dbo].[DataDictionary] ([ServerName], [DatabaseName], [SchemaName], [TableName], [OrdinalPosition])
GO
EXEC sp_addextendedproperty @name = N'MS_Description',
@value = N'The name of the last linked server in a linked server chain (not necessarily the hostname of the server itself)',
@level0type = N'SCHEMA',
@level0name = N'dbo',
@level1type = N'TABLE',
@level1name = N'DataDictionary',
@level2type = N'COLUMN',
@level2name = N'ServerName'
GO
EXEC sp_addextendedproperty @name = N'MS_Description',
@value = N'Populated when the column disappears (i.e. when table, schema, or server is gone). Can be reverted to NULL when column reappears',
@level0type = N'SCHEMA',
@level0name = N'dbo',
@level1type = N'TABLE',
@level1name = N'DataDictionary',
@level2type = N'COLUMN',
@level2name = N'LastSeenAt'
GO
EXEC sp_addextendedproperty @name = N'MS_Description',
@value = N'NULL means the column has been deleted',
@level0type = N'SCHEMA',
@level0name = N'dbo',
@level1type = N'TABLE',
@level1name = N'DataDictionary',
@level2type = N'COLUMN',
@level2name = N'OrdinalPosition'
Collecting schema from SQL Server
The following query is for collecting the data dictionary fields from SQL Server, with $DatabaseName
being the placeholder for specifying the SQL Database name.
To filter out irrelevant tables, a WHERE clause has been added to exclude tables under a user schema (you should adjust the DOMAIN\
accordingly), tables tagged with TEST
or BAK
, and internal tables created by Master Data Service (MDS).
SELECT
schemas.name AS SchemaName,
tables.name AS TableName,
columns.name AS ColumnName,
ROW_NUMBER() OVER (PARTITION BY columns.object_id ORDER BY columns.column_id) AS OrdinalPosition,
types.name + CASE
WHEN types.name IN ('binary', 'char', 'varbinary', 'varchar') THEN '(' + CASE WHEN columns.max_length = -1 THEN 'MAX' ELSE CAST(columns.max_length AS VARCHAR(255)) END + ')'
WHEN types.name IN ('nchar', 'nvarchar') THEN '(' + CASE WHEN columns.max_length = -1 THEN 'MAX' ELSE CAST(columns.max_length / 2 AS VARCHAR(255)) END + ')'
WHEN types.name IN ('decimal', 'numeric') THEN '(' + CAST(columns.precision AS VARCHAR(255)) + ',' + CAST(columns.scale AS VARCHAR(255)) + ')'
WHEN types.name IN ('datetime2', 'datetimeoffset', 'time') THEN '(' + CAST(columns.scale AS VARCHAR(255)) + ')'
ELSE ''
END AS DataType,
fk_ref.ReferencedColumnQN AS ForeignKeyTo,
table_desc.Description AS TableDescription,
col_desc.Description AS ColumnDescription,
tstat.TableRowCount
FROM $DatabaseName.sys.columns columns
INNER JOIN (
SELECT name, object_id, schema_id, is_ms_shipped FROM $DatabaseName.sys.tables
UNION ALL
SELECT name, object_id, schema_id, is_ms_shipped FROM $DatabaseName.sys.views
) tables ON tables.object_id = columns.object_id
INNER JOIN $DatabaseName.sys.schemas schemas ON schemas.schema_id = tables.schema_id
INNER JOIN $DatabaseName.sys.types types ON types.user_type_id = columns.user_type_id
LEFT JOIN ( -- Get table row count (LEFT JOIN because VIEWs do not have a row count)
-- Ref: https://stackoverflow.com/a/28917736
SELECT object_id, SUM([rows]) AS TableRowCount
FROM $DatabaseName.sys.partitions
WHERE index_id in (0,1)
GROUP BY object_id
) tstat ON tstat.object_id = tables.object_id
LEFT JOIN ( -- Get foreign key references (If one column has multiple FK, one will be chosen to display arbitrarily, however this should be rare in any sane DB design)
SELECT
parent_object_id,
parent_column_id,
ReferencedColumnQN
FROM (
SELECT
fkc.parent_object_id,
fkc.parent_column_id,
QUOTENAME(s_child.name) + '.' + QUOTENAME(t_child.name) + '.' + QUOTENAME(c_child.name) AS ReferencedColumnQN, -- Qualified Name
ROW_NUMBER() OVER (PARTITION BY fkc.parent_object_id, fkc.parent_column_id ORDER BY fkc.constraint_object_id) AS Priority -- ORDER BY is arbitrary
FROM $DatabaseName.sys.foreign_keys fk
INNER JOIN $DatabaseName.sys.foreign_key_columns fkc ON fkc.constraint_object_id = fk.object_id
INNER JOIN $DatabaseName.sys.tables t_child ON t_child.object_id = fkc.referenced_object_id
INNER JOIN $DatabaseName.sys.columns c_child ON c_child.object_id = t_child.object_id AND fkc.referenced_column_id = c_child.column_id
INNER JOIN $DatabaseName.sys.schemas s_child ON s_child.schema_id = t_child.schema_id
) t
WHERE Priority = 1
) fk_ref ON fk_ref.parent_object_id = columns.object_id AND fk_ref.parent_column_id = columns.column_id
LEFT JOIN (
-- Ref: https://learn.microsoft.com/en-us/sql/relational-databases/system-catalog-views/extended-properties-catalog-views-sys-extended-properties?view=sql-server-ver16
SELECT
major_id AS object_id,
CAST(value AS NVARCHAR(1024)) AS Description
FROM $DatabaseName.sys.extended_properties
WHERE class = 1 AND minor_id = 0 AND name = 'MS_Description'
) table_desc ON table_desc.object_id = columns.object_id
LEFT JOIN (
-- See table_desc reference
SELECT
major_id AS object_id,
minor_id AS column_id,
CAST(value AS NVARCHAR(1024)) AS Description
FROM $DatabaseName.sys.extended_properties
WHERE class = 1 AND name = 'MS_Description'
) col_desc ON col_desc.object_id = columns.object_id AND col_desc.column_id = columns.column_id
WHERE
tables.is_ms_shipped = 0
AND NOT (schemas.name LIKE 'DOMAIN\%') -- Ignore personal tables
AND NOT (tables.name LIKE '%_BAK_%' OR tables.name LIKE '%_TEST_%') -- Intermediate tables
AND NOT (schemas.name = 'mdm' AND (tables.name LIKE 'viw_%' OR tables.name LIKE 'tbl_%')) -- MDS internal tables
AND NOT (schemas.name = 'stg' AND (tables.name LIKE 'viw_%' OR tables.name LIKE '%_Leaf')) -- MDS internal tables
AND NOT (schemas.name = 'stg' AND tables.name LIKE '[[]stg].[[]viw_%') -- MDS internal tables, name looks malformed but it exists
Collecting schema from MySQL
The following query is for collecting the data dictionary fields from MySQL. Since there is no catalog or database in MySQL, there is no $DatabaseName
placeholder in this SQL, and as a result the SQL for pulling MySQL schema is much simpler.
It should be noted that, the row count estimation could be a very rough figure, as documented in MySQL Reference Manual:
The number of rows. Some storage engines, such as MyISAM, store the exact count. For other storage engines, such as InnoDB, this value is an approximation, and may vary from the actual value by as much as 40% to 50%. In such cases, use SELECT COUNT(*) to obtain an accurate count.
SELECT
columns.TABLE_SCHEMA AS SchemaName,
columns.TABLE_NAME AS TableName,
columns.COLUMN_NAME AS ColumnName,
columns.ORDINAL_POSITION AS OrdinalPosition,
columns.COLUMN_TYPE AS DataType, -- Can list unassigned for number and enum type
fk_ref.ForeignKeyQN AS ForeignKeyTo,
LEFT(tables.TABLE_COMMENT, 1024) AS TableDescription,
LEFT(columns.COLUMN_COMMENT, 1024) AS ColumnDescription,
tables.TABLE_ROWS AS TableRowCount
FROM INFORMATION_SCHEMA.COLUMNS columns
INNER JOIN INFORMATION_SCHEMA.TABLES tables ON tables.TABLE_CATALOG = columns.TABLE_CATALOG AND tables.TABLE_SCHEMA = columns.TABLE_SCHEMA AND tables.TABLE_NAME = columns.TABLE_NAME
LEFT JOIN (
SELECT
rc.CONSTRAINT_SCHEMA,
rc.TABLE_NAME,
kcu.COLUMN_NAME,
GROUP_CONCAT(CONCAT_WS(''.'', kcu.REFERENCED_TABLE_SCHEMA, kcu.REFERENCED_TABLE_NAME, kcu.REFERENCED_COLUMN_NAME)) AS ForeignKeyQN
FROM INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS rc
JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcu ON
rc.CONSTRAINT_SCHEMA = kcu.CONSTRAINT_SCHEMA
AND rc.TABLE_NAME = kcu.TABLE_NAME
AND rc.CONSTRAINT_NAME = kcu.CONSTRAINT_NAME
GROUP BY rc.CONSTRAINT_SCHEMA, rc.TABLE_NAME, kcu.COLUMN_NAME
) fk_ref ON fk_ref.CONSTRAINT_SCHEMA = columns.TABLE_SCHEMA AND fk_ref.TABLE_NAME = columns.TABLE_NAME AND fk_ref.COLUMN_NAME = columns.COLUMN_NAME
WHERE columns.TABLE_SCHEMA NOT IN ('sys', 'mysql', 'information_schema', 'performance_schema', 'mysql_innodb_cluster_metadata')
Collecting schema from Postgres
The following query is for collecting the data dictionary fields from Postgres. $DatabaseName
is the placeholder for specifying the SQL Database name.
Note that:
- The row count does work for partitioned table (equals to the sum of row count of the individual table partitions)
- The row count accuracy “depends on whether ANALYZE or VACUUM are run enough - where “enough” is defined by the level of write activity to your table”. See this StackOverflow answer
- The foreign key information is not collected (because I am not familiar enough with Postgres)
SELECT
schemas.nspname AS SchemaName,
tables.relname AS TableName,
columns.attname AS ColumnName,
ROW_NUMBER() OVER (PARTITION BY columns.attrelid ORDER BY columns.attnum) AS OrdinalPosition,
format_type(columns.atttypid, columns.atttypmod) AS DataType,
NULL AS ForeignKeyTo, -- Not implemented
left(obj_description(columns.attrelid, 'pg_class'), 1024) AS TableDescription,
left(col_description(columns.attrelid, columns.attnum), 1024) AS ColumnDescription,
tables.reltuples::bigint + COALESCE(tstat2.children_reltuples::bigint, 0) AS TableRowCount
FROM $DatabaseName.pg_catalog.pg_attribute columns
INNER JOIN (
SELECT oid, relname, relnamespace, reltuples
FROM $DatabaseName.pg_catalog.pg_class
WHERE relkind IN ('p', 'r', 'v', 'm') AND NOT relispartition
) tables ON tables.oid = columns.attrelid
INNER JOIN $DatabaseName.pg_catalog.pg_namespace schemas ON schemas.oid = tables.relnamespace
LEFT JOIN (
SELECT pg_inherits.inhparent AS oid, SUM(reltuples::bigint) AS children_reltuples
FROM $DatabaseName.pg_catalog.pg_inherits
INNER JOIN $DatabaseName.pg_catalog.pg_class children ON children.oid = pg_inherits.inhrelid
GROUP BY pg_inherits.inhparent
) tstat2 ON tstat2.oid = columns.attrelid
WHERE
columns.attnum > 0
AND NOT columns.attisdropped
AND schemas.nspname NOT IN ('pg_catalog', 'information_schema')
Run the queries against linked servers
As mentioned in the beginning, we have some tens of linked servers (or even double linked servers) and databases to crawl, therefore a declarative script is crucial to fetch all the schemas without repeating the bulky SQLs, this is the reason why $DatabaseName
is used throughout the standalone SQLs above.
Here we have hardcoded to support only 2 levels of linked server, this limitation came from the fact that we are using SQL Server 2016, and we cannot reliably split notations like JUMP1.JUMP2.JUMP3.APP_DB
because the STRING_SPLIT function does not guarantee order until SQL Server 2022 with <em>enable_ordinal</em> argument, and since in our environment max is second level so we went with 2. It should be trivial to duplicate relevant lines to support more levels.
We also used a table variable instead of directly writing into the data dictionary table to avoid locks and black out period.
DECLARE @SQLServerCols VARCHAR(MAX) = '...';
DECLARE @MySQLCols VARCHAR(MAX) = '...';
DECLARE @PostgresCols VARCHAR(MAX) = '...';
DECLARE @Targets TABLE(
SystemName VARCHAR(255) PRIMARY KEY,
ServerNameL1 VARCHAR(255),
ServerNameL2 VARCHAR(255),
DatabaseName VARCHAR(255),
ColsSQLTemplate VARCHAR(MAX)
);
INSERT INTO @Targets VALUES
-- Local databases
('Application 1', NULL, NULL, 'App1', @SQLServerCols),
-- Linked Servers
('Application 2', 'SVR_APP2_DB', NULL, 'App2', @SQLServerCols),
-- Double-Linked Server
('Application 3', 'SVR_WAREHOUSE', 'SVR_APP3_DB', 'App3', @SQLServerCols),
('Application 4', 'SVR_WAREHOUSE', 'SVR_APP4_DB', 'App4', @SQLServerCols),
('Application 5', 'SVR_JUMPHOST', 'SVR_APP5_DB', 'myreportdb', @PostgresCols),
('WordPress Website', 'SVR_JUMPHOST', 'SVR_WP_DB', 'def', @MySQLCols)
;
DECLARE @SystemName VARCHAR(255);
DECLARE @ServerNameL1 VARCHAR(255);
DECLARE @ServerNameL2 VARCHAR(255);
DECLARE @DatabaseName VARCHAR(255);
DECLARE @ColsSQLTemplate VARCHAR(MAX);
DECLARE @DataDictionary TABLE -- See DataDictionary table structure
(
[SystemName] VARCHAR(255) NOT NULL ,
[ServerName] VARCHAR(255) NOT NULL,
[DatabaseName] VARCHAR(255) NOT NULL,
[SchemaName] VARCHAR(255) NOT NULL,
[TableName] VARCHAR(255) NOT NULL,
[ColumnName] VARCHAR(255) NOT NULL,
[OrdinalPosition] INT NULL,
[DataType] VARCHAR(255) NULL,
[ForeignKeyTo] VARCHAR(255) NULL,
[TableDescription] NVARCHAR(1024) NULL,
[ColumnDescription] NVARCHAR(1024) NULL,
[TableRowCount] BIGINT NULL
)
DECLARE MyCursor CURSOR LOCAL STATIC READ_ONLY FORWARD_ONLY
FOR
SELECT * FROM @Targets;
OPEN MyCursor
FETCH NEXT FROM MyCursor INTO @SystemName, @ServerNameL1, @ServerNameL2, @DatabaseName, @ColsSQLTemplate
WHILE @@FETCH_STATUS = 0
BEGIN
DECLARE @TSQL VARCHAR(MAX) = REPLACE(@ColsSQLTemplate, '$DatabaseName', @DatabaseName);
IF @ServerNameL2 IS NOT NULL
SET @TSQL = 'SELECT * FROM OPENQUERY(' + @ServerNameL2 + ',''' + REPLACE(@TSQL, '''', '''''') + ''')';
IF @ServerNameL1 IS NOT NULL
SET @TSQL = 'SELECT * FROM OPENQUERY(' + @ServerNameL1 + ',''' + REPLACE(@TSQL, '''', '''''') + ''')';
SET @TSQL = 'SELECT ''' + @SystemName + ''' AS SystemName, ''' + COALESCE(@ServerNameL2, @ServerNameL1, @@SERVERNAME) + ''' AS ServerName, ''' + @DatabaseName + ''' AS DatabaseName, *
FROM (' + @TSQL + ') t
ORDER BY SchemaName, TableName, OrdinalPosition';
RAISERROR ('Processing System "%s"', 0, 1, @SystemName) WITH NOWAIT;
INSERT INTO @DataDictionary
EXEC (@TSQL);
FETCH NEXT FROM MyCursor INTO @SystemName, @ServerNameL1, @ServerNameL2, @DatabaseName, @ColsSQLTemplate
END
CLOSE MyCursor
DEALLOCATE MyCursor
Writing the final result into DataDictionary
Here we will make use of SQL Server’s MERGE
statement, which allows us to simultaneously write newly found columns, update existing column definitions, and mark existing column as deleted, while simultaneously retaining the [ColumnDescriptionDD]
column we defined and populated manually.
MERGE
INTO DataDictionary AS dd
USING @DataDictionary AS ddnew
ON
dd.ServerName = ddnew.ServerName
AND dd.DatabaseName = ddnew.DatabaseName
AND dd.SchemaName = ddnew.SchemaName
AND dd.TableName = ddnew.TableName
AND dd.ColumnName = ddnew.ColumnName
WHEN MATCHED
THEN
UPDATE
SET OrdinalPosition = ddnew.OrdinalPosition,
DataType = ddnew.DataType,
ForeignKeyTo = ddnew.ForeignKeyTo,
TableDescription = ddnew.TableDescription,
ColumnDescription = ddnew.ColumnDescription,
TableRowCount = ddnew.TableRowCount,
LastSeenAt = NULL
WHEN NOT MATCHED
THEN
INSERT (SystemName, ServerName, DatabaseName, SchemaName, TableName, ColumnName, OrdinalPosition, DataType, ForeignKeyTo, TableDescription, ColumnDescription, TableRowCount)
VALUES (SystemName, ServerName, DatabaseName, SchemaName, TableName, ColumnName, OrdinalPosition, DataType, ForeignKeyTo, TableDescription, ColumnDescription, TableRowCount)
WHEN NOT MATCHED BY SOURCE
THEN
UPDATE
SET OrdinalPosition = NULL,
LastSeenAt = CURRENT_TIMESTAMP
Putting it altogether
Now that we have all the puzzle pieces, we will combine them into this gigantic SQL enclosed in a stored procedure below.
Before you run the script, make sure to:
- Inspect and update the
WHERE
clauses, adjust to your environment’s need - Change the
DOMAIN\
filter to match your domain’s name (or remove if you are not in a domain environment) - Update the
@Targets
variable to specify linked servers and database names you would like to crawl from
And you are good to go! You can then deploy the procedure and set up a scheduled task or SQL Agent Job to routinely pull the schemas for you.
CREATE PROCEDURE [dbo].[usp_CollectDataDictionary]
AS
SET NOCOUNT ON;
SET XACT_ABORT ON;
DECLARE @SQLServerCols VARCHAR(MAX) = '
SELECT
schemas.name AS SchemaName,
tables.name AS TableName,
columns.name AS ColumnName,
ROW_NUMBER() OVER (PARTITION BY columns.object_id ORDER BY columns.column_id) AS OrdinalPosition,
types.name + CASE
WHEN types.name IN (''binary'', ''char'', ''varbinary'', ''varchar'') THEN ''('' + CASE WHEN columns.max_length = -1 THEN ''MAX'' ELSE CAST(columns.max_length AS VARCHAR(255)) END + '')''
WHEN types.name IN (''nchar'', ''nvarchar'') THEN ''('' + CASE WHEN columns.max_length = -1 THEN ''MAX'' ELSE CAST(columns.max_length / 2 AS VARCHAR(255)) END + '')''
WHEN types.name IN (''decimal'', ''numeric'') THEN ''('' + CAST(columns.precision AS VARCHAR(255)) + '','' + CAST(columns.scale AS VARCHAR(255)) + '')''
WHEN types.name IN (''datetime2'', ''datetimeoffset'', ''time'') THEN ''('' + CAST(columns.scale AS VARCHAR(255)) + '')''
ELSE ''''
END AS DataType,
fk_ref.ReferencedColumnQN AS ForeignKeyTo,
table_desc.Description AS TableDescription,
col_desc.Description AS ColumnDescription,
tstat.TableRowCount
FROM $DatabaseName.sys.columns columns
INNER JOIN (
SELECT name, object_id, schema_id, is_ms_shipped FROM $DatabaseName.sys.tables
UNION ALL
SELECT name, object_id, schema_id, is_ms_shipped FROM $DatabaseName.sys.views
) tables ON tables.object_id = columns.object_id
INNER JOIN $DatabaseName.sys.schemas schemas ON schemas.schema_id = tables.schema_id
INNER JOIN $DatabaseName.sys.types types ON types.user_type_id = columns.user_type_id
LEFT JOIN ( -- Get table row count (LEFT JOIN because VIEWs do not have a row count)
-- Ref: https://stackoverflow.com/a/28917736
SELECT object_id, SUM([rows]) AS TableRowCount
FROM $DatabaseName.sys.partitions
WHERE index_id in (0,1)
GROUP BY object_id
) tstat ON tstat.object_id = tables.object_id
LEFT JOIN ( -- Get foreign key references (If one column has multiple FK, one will be chosen to display arbitrarily, however this should be rare in any sane DB design)
SELECT
parent_object_id,
parent_column_id,
ReferencedColumnQN
FROM (
SELECT
fkc.parent_object_id,
fkc.parent_column_id,
QUOTENAME(s_child.name) + ''.'' + QUOTENAME(t_child.name) + ''.'' + QUOTENAME(c_child.name) AS ReferencedColumnQN, -- Qualified Name
ROW_NUMBER() OVER (PARTITION BY fkc.parent_object_id, fkc.parent_column_id ORDER BY fkc.constraint_object_id) AS Priority -- ORDER BY is arbitrary
FROM $DatabaseName.sys.foreign_keys fk
INNER JOIN $DatabaseName.sys.foreign_key_columns fkc ON fkc.constraint_object_id = fk.object_id
INNER JOIN $DatabaseName.sys.tables t_child ON t_child.object_id = fkc.referenced_object_id
INNER JOIN $DatabaseName.sys.columns c_child ON c_child.object_id = t_child.object_id AND fkc.referenced_column_id = c_child.column_id
INNER JOIN $DatabaseName.sys.schemas s_child ON s_child.schema_id = t_child.schema_id
) t
WHERE Priority = 1
) fk_ref ON fk_ref.parent_object_id = columns.object_id AND fk_ref.parent_column_id = columns.column_id
LEFT JOIN (
-- Ref: https://learn.microsoft.com/en-us/sql/relational-databases/system-catalog-views/extended-properties-catalog-views-sys-extended-properties?view=sql-server-ver16
SELECT
major_id AS object_id,
CAST(value AS NVARCHAR(1024)) AS Description
FROM $DatabaseName.sys.extended_properties
WHERE class = 1 AND minor_id = 0 AND name = ''MS_Description''
) table_desc ON table_desc.object_id = columns.object_id
LEFT JOIN (
-- See table_desc reference
SELECT
major_id AS object_id,
minor_id AS column_id,
CAST(value AS NVARCHAR(1024)) AS Description
FROM $DatabaseName.sys.extended_properties
WHERE class = 1 AND name = ''MS_Description''
) col_desc ON col_desc.object_id = columns.object_id AND col_desc.column_id = columns.column_id
WHERE
tables.is_ms_shipped = 0
AND NOT (schemas.name LIKE ''DOMAIN\%'') -- Ignore personal tables
AND NOT (tables.name LIKE ''%_BAK_%'' OR tables.name LIKE ''%_TEST_%'') -- Intermediate tables
AND NOT (schemas.name = ''mdm'' AND (tables.name LIKE ''viw_%'' OR tables.name LIKE ''tbl_%'')) -- MDS internal tables
AND NOT (schemas.name = ''stg'' AND (tables.name LIKE ''viw_%'' OR tables.name LIKE ''%_Leaf'')) -- MDS internal tables
AND NOT (schemas.name = ''stg'' AND tables.name LIKE ''[[]stg].[[]viw_%'') -- MDS internal tables, name looks malformed but it exists
';
DECLARE @MySQLCols VARCHAR(MAX) = /* For MySQL, there is no catalog or database, please use 'def' as the DatabaseName in-line with MySQL's internal naming */ '
SELECT
columns.TABLE_SCHEMA AS SchemaName,
columns.TABLE_NAME AS TableName,
columns.COLUMN_NAME AS ColumnName,
columns.ORDINAL_POSITION AS OrdinalPosition,
columns.COLUMN_TYPE AS DataType, -- Can list unassigned for number and enum type
fk_ref.ForeignKeyQN AS ForeignKeyTo,
LEFT(tables.TABLE_COMMENT, 1024) AS TableDescription,
LEFT(columns.COLUMN_COMMENT, 1024) AS ColumnDescription,
tables.TABLE_ROWS AS TableRowCount -- "... InnoDB, this value is an approximation, and may vary from the actual value by as much as 40% to 50% ..." -- Ref: https://dev.mysql.com/doc/refman/8.0/en/information-schema-tables-table.html
FROM INFORMATION_SCHEMA.COLUMNS columns
INNER JOIN INFORMATION_SCHEMA.TABLES tables ON tables.TABLE_CATALOG = columns.TABLE_CATALOG AND tables.TABLE_SCHEMA = columns.TABLE_SCHEMA AND tables.TABLE_NAME = columns.TABLE_NAME
LEFT JOIN (
-- Ref: https://www.db-fiddle.com/f/sryFQ4D9fSwf9xPXh16S8S/0
SELECT
rc.CONSTRAINT_SCHEMA,
rc.TABLE_NAME,
kcu.COLUMN_NAME,
GROUP_CONCAT(CONCAT_WS(''.'', kcu.REFERENCED_TABLE_SCHEMA, kcu.REFERENCED_TABLE_NAME, kcu.REFERENCED_COLUMN_NAME)) AS ForeignKeyQN
FROM INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS rc
JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcu ON
rc.CONSTRAINT_SCHEMA = kcu.CONSTRAINT_SCHEMA
AND rc.TABLE_NAME = kcu.TABLE_NAME
AND rc.CONSTRAINT_NAME = kcu.CONSTRAINT_NAME
GROUP BY rc.CONSTRAINT_SCHEMA, rc.TABLE_NAME, kcu.COLUMN_NAME
) fk_ref ON fk_ref.CONSTRAINT_SCHEMA = columns.TABLE_SCHEMA AND fk_ref.TABLE_NAME = columns.TABLE_NAME AND fk_ref.COLUMN_NAME = columns.COLUMN_NAME
WHERE columns.TABLE_SCHEMA NOT IN (''sys'', ''mysql'', ''information_schema'', ''performance_schema'', ''mysql_innodb_cluster_metadata'')
';
DECLARE @PostgresCols VARCHAR(MAX) = '
SELECT
schemas.nspname AS SchemaName,
tables.relname AS TableName,
columns.attname AS ColumnName,
ROW_NUMBER() OVER (PARTITION BY columns.attrelid ORDER BY columns.attnum) AS OrdinalPosition,
format_type(columns.atttypid, columns.atttypmod) AS DataType,
NULL AS ForeignKeyTo, -- Not implemented
left(obj_description(columns.attrelid, ''pg_class''), 1024) AS TableDescription,
left(col_description(columns.attrelid, columns.attnum), 1024) AS ColumnDescription,
tables.reltuples::bigint + COALESCE(tstat2.children_reltuples::bigint, 0) AS TableRowCount
FROM $DatabaseName.pg_catalog.pg_attribute columns
INNER JOIN (
SELECT oid, relname, relnamespace, reltuples
FROM $DatabaseName.pg_catalog.pg_class
WHERE relkind IN (''p'', ''r'', ''v'', ''m'') AND NOT relispartition
) tables ON tables.oid = columns.attrelid
INNER JOIN $DatabaseName.pg_catalog.pg_namespace schemas ON schemas.oid = tables.relnamespace
LEFT JOIN (
SELECT pg_inherits.inhparent AS oid, SUM(reltuples::bigint) AS children_reltuples
FROM $DatabaseName.pg_catalog.pg_inherits
INNER JOIN $DatabaseName.pg_catalog.pg_class children ON children.oid = pg_inherits.inhrelid
GROUP BY pg_inherits.inhparent
) tstat2 ON tstat2.oid = columns.attrelid
WHERE
columns.attnum > 0
AND NOT columns.attisdropped
AND schemas.nspname NOT IN (''pg_catalog'', ''information_schema'')
';
DECLARE @Targets TABLE(
SystemName VARCHAR(255) PRIMARY KEY,
ServerNameL1 VARCHAR(255),
ServerNameL2 VARCHAR(255),
DatabaseName VARCHAR(255),
ColsSQLTemplate VARCHAR(MAX)
);
INSERT INTO @Targets VALUES
-- Local databases
('Application 1', NULL, NULL, 'App1', @SQLServerCols),
-- Linked Servers
('Application 2', 'SVR_APP2_DB', NULL, 'App2', @SQLServerCols),
-- Double-Linked Server
('Application 3', 'SVR_WAREHOUSE', 'SVR_APP3_DB', 'App3', @SQLServerCols),
('Application 4', 'SVR_WAREHOUSE', 'SVR_APP4_DB', 'App4', @SQLServerCols),
('Application 5', 'SVR_JUMPHOST', 'SVR_APP5_DB', 'myreportdb', @PostgresCols),
('WordPress Website', 'SVR_JUMPHOST', 'SVR_WP_DB', 'def', @MySQLCols)
;
DECLARE @SystemName VARCHAR(255);
DECLARE @ServerNameL1 VARCHAR(255);
DECLARE @ServerNameL2 VARCHAR(255);
DECLARE @DatabaseName VARCHAR(255);
DECLARE @ColsSQLTemplate VARCHAR(MAX);
DECLARE @DataDictionary TABLE -- See DataDictionary table structure
(
[SystemName] VARCHAR(255) NOT NULL ,
[ServerName] VARCHAR(255) NOT NULL,
[DatabaseName] VARCHAR(255) NOT NULL,
[SchemaName] VARCHAR(255) NOT NULL,
[TableName] VARCHAR(255) NOT NULL,
[ColumnName] VARCHAR(255) NOT NULL,
[OrdinalPosition] INT NULL,
[DataType] VARCHAR(255) NULL,
[ForeignKeyTo] VARCHAR(255) NULL,
[TableDescription] NVARCHAR(1024) NULL,
[ColumnDescription] NVARCHAR(1024) NULL,
[TableRowCount] BIGINT NULL
)
DECLARE MyCursor CURSOR LOCAL STATIC READ_ONLY FORWARD_ONLY
FOR
SELECT * FROM @Targets;
OPEN MyCursor
FETCH NEXT FROM MyCursor INTO @SystemName, @ServerNameL1, @ServerNameL2, @DatabaseName, @ColsSQLTemplate
WHILE @@FETCH_STATUS = 0
BEGIN
DECLARE @TSQL VARCHAR(MAX) = REPLACE(@ColsSQLTemplate, '$DatabaseName', @DatabaseName);
IF @ServerNameL2 IS NOT NULL
SET @TSQL = 'SELECT * FROM OPENQUERY(' + @ServerNameL2 + ',''' + REPLACE(@TSQL, '''', '''''') + ''')';
IF @ServerNameL1 IS NOT NULL
SET @TSQL = 'SELECT * FROM OPENQUERY(' + @ServerNameL1 + ',''' + REPLACE(@TSQL, '''', '''''') + ''')';
SET @TSQL = 'SELECT ''' + @SystemName + ''' AS SystemName, ''' + COALESCE(@ServerNameL2, @ServerNameL1, @@SERVERNAME) + ''' AS ServerName, ''' + @DatabaseName + ''' AS DatabaseName, *
FROM (' + @TSQL + ') t
ORDER BY SchemaName, TableName, OrdinalPosition';
RAISERROR ('Processing System "%s"', 0, 1, @SystemName) WITH NOWAIT;
INSERT INTO @DataDictionary
EXEC (@TSQL);
FETCH NEXT FROM MyCursor INTO @SystemName, @ServerNameL1, @ServerNameL2, @DatabaseName, @ColsSQLTemplate
END
CLOSE MyCursor
DEALLOCATE MyCursor
MERGE
INTO DataDictionary AS dd
USING @DataDictionary AS ddnew
ON
dd.ServerName = ddnew.ServerName
AND dd.DatabaseName = ddnew.DatabaseName
AND dd.SchemaName = ddnew.SchemaName
AND dd.TableName = ddnew.TableName
AND dd.ColumnName = ddnew.ColumnName
WHEN MATCHED
THEN
UPDATE
SET OrdinalPosition = ddnew.OrdinalPosition,
DataType = ddnew.DataType,
ForeignKeyTo = ddnew.ForeignKeyTo,
TableDescription = ddnew.TableDescription,
ColumnDescription = ddnew.ColumnDescription,
TableRowCount = ddnew.TableRowCount,
LastSeenAt = NULL
WHEN NOT MATCHED
THEN
INSERT (SystemName, ServerName, DatabaseName, SchemaName, TableName, ColumnName, OrdinalPosition, DataType, ForeignKeyTo, TableDescription, ColumnDescription, TableRowCount)
VALUES (SystemName, ServerName, DatabaseName, SchemaName, TableName, ColumnName, OrdinalPosition, DataType, ForeignKeyTo, TableDescription, ColumnDescription, TableRowCount)
WHEN NOT MATCHED BY SOURCE
THEN
UPDATE
SET OrdinalPosition = NULL,
LastSeenAt = CURRENT_TIMESTAMP
;
RETURN 0