2015-02-25

How to reduce the database size - Part 1

Large database is not bad, it means the system scale is large, many sales orders rush into the system everyday, your company is doing a good business, and as a DBA you don't need to worry about losing job. The problem, however, large databases are more expensive to support and maintain, they're usually requiring powerful hardware to run, and high storage cost.
Reducing the database size often helps to achieve other goals. Think about Disaster Recovery (DR), smaller database means faster to restore. And also think about performance, smaller table and row size means less I/O to access the data.
In this series, I'm going to discuss different measures to reduce database file (mdf and ndf) size. Some of them are transparent to the client applications; others require code refactoring.

0. Check the free space in data file
In order to make sure the large data files size is really due to large data inside, you should check the free space in the data files before anything else. However, I'm not telling you that you should shrink a database, which will result fragmentation in the data file (ref. Don’t Touch that Shrink Database Button!). Again, this step is just to make sure you're dealing with a database really filled by large data. The script below can be used to check the amount of allocated and unallocated space on per-database file basis of the current database.
SELECT f.type_desc AS [Type]
,f.name AS [FileName]
,fg.name AS [FileGroup]
,f.physical_name AS [Path]
,f.size / 128.0 AS [CurrentSizeMB]
,f.size / 128.0 - convert(int,fileproperty(f.name,'SpaceUsed')) / 128.0 AS [FreeSpaceMb]
FROM sys.database_files f WITH (NOLOCK) LEFT JOIN sys.filegroups fg WITH (NOLOCK)
ON f.data_space_id = fg.data_space_id
OPTION (RECOMPILE)

1. Find out the most space consuming objects
Tables (essentially clustered indexes or heaps) and indexes (non-clustered indexes) are the database objects which contain data. Other objects such as stored procedures, views, and functions don't spend you much space. The script below helps you to check the space usage per-object basis.
;with SpaceInfo(ObjectId, IndexId, TableName, IndexName
,Rows, TotalSpaceMB, UsedSpaceMB)
as
(
select
t.object_id as [ObjectId]
,i.index_id as [IndexId]
,s.name + '.' + t.Name as [TableName]
,i.name as [Index Name]
,sum(p.[Rows]) as [Rows]
,sum(au.total_pages) * 8 / 1024 as [Total Space MB]
,sum(au.used_pages) * 8 / 1024 as [Used Space MB]
from
sys.tables t with (nolock) join
sys.schemas s with (nolock) on
s.schema_id = t.schema_id
join sys.indexes i with (nolock) on
t.object_id = i.object_id
join sys.partitions p with (nolock) on
i.object_id = p.object_id and
i.index_id = p.index_id
cross apply
(
select
sum(a.total_pages) as total_pages
,sum(a.used_pages) as used_pages
from sys.allocation_units a with (nolock)
where p.partition_id = a.container_id
) au
where
i.object_id > 255
group by
t.object_id, i.index_id, s.name, t.name, i.name
)
select
ObjectId, IndexId, TableName, IndexName
,Rows, TotalSpaceMB, UsedSpaceMB
,TotalSpaceMB - UsedSpaceMB as [ReservedSpaceMB]
from
SpaceInfo
order by
TotalSpaceMB desc
option (recompile)
The figure below shows the output of this script, which quickly pinpoint the most space consuming objects.


2. Low page density (aka internal fragmentation) and index defragmentation
Typical, day to day activity causes indexes to fragment over time. Changes to your data – inserts, updates, deletes, and even changing varchar values contribute to fragmentation. Fragmentation exists in two kinds. Logical fragmentation (aka external fragmentation), which caused by the page split operations SQL server perform to accommodate new rows (insert) and increased row length (update), means that the logical order (determined by the index key values) of the pages does not match their physical order on the disk, and/or logically subsequent pages are not located in the same or adjacent extents (extent is the group of 8 pages). Low page density (aka internal fragmentation) is when there is empty space on data file pages in an index, either caused by page split operations, record deletion, or record size where only a few records can fit on a page, forcing empty space (e.g. a 5KB record size, where only one record can fit per 8KB page, leading to 3KB of empty, wasted space per page). This leads to reduced data density – extra space required on disk to store the data, wasted buffer pool memory, and more I/Os are required to read the same amount of data. A small degree of internal fragmentation (free space in index pages) is not necessarily bad. It reduces page splits during insert and update operations (the same effect as fill factor). Nonetheless, a large degree of internal fragmentation wastes space and reduces the performance of the system. Moreover, for indexes with ever-increasing keys, e.g. IDENTITY columns, internal fragmentation is not desirable because the data is always inserted at the end of the index. You can monitor both, internal and external fragmentation with sys.dm_db_index_physical_stats DMV. Internal fragmentation is reflected in the avg_page_space_used_in_percent column. Lower value in the column indicates higher degree of internal fragmentation. The script below shows the page density of all indexes in the current database. It's suggested to run on a Non-production server.
/*
* Check Indexes' Page Density & Logical Fragmentation level.
* Suggested to run on a NON-Production environment to avoid extra system workload.
*/
SELECT
OBJECT_NAME([object_id], database_id) AS [object],
index_id, partition_number, index_level,
index_type_desc, alloc_unit_type_desc,
page_count, avg_page_space_used_in_percent AS [page density %], avg_fragmentation_in_percent AS [logical fragmentation %]
FROM sys.dm_db_index_physical_stats(DB_ID(), DEFAULT, DEFAULT, DEFAULT, 'DETAILED')
ORDER BY [object], index_id, partition_number, index_level DESC
Figure below illustrated partial output of the script.

To better understand the result above, let me briefly explain you the table and index organization. Table (essentially clustered index or heap) and (non-clustered) index are organized the same way. The figure below shows the organization of a table. A table is contained in one or more partitions and each partition contains data rows in either a heap or a clustered index structure (b-tree). The pages of the heap or clustered index are managed in one or more allocation units (IN_ROW_DATA, LOB_DATA, and ROW_OVERFLOW_DATA), depending on the column types in the data rows.

As the tables in my example database are not partitioned, all allocation units belong to partition 1 there. If you have any partitioned tables, you will see separate rows in the result – one per partition per allocation unit. The index_level column is the current level of the index. 0 for index leaf levels, heaps, and LOB_DATA or ROW_OVERFLOW_DATA allocation units. Greater than 0 for nonleaf index levels. index_level will be the highest at the root level of an index.
You can remove internal fragmentation by rebuilding the index (ALTER INDEX ... REBUILD). Pay attention that OFFLINE rebuild is the default (and I prefer OFFLINE rather than ONLINE, ref. Be careful when scheduling an online index rebuild). If possible always try to define the maintenance window for the database. Select critical tables that are fragmented and define offline index rebuilds for them. Generally you don’t have to pay attention to heaps and to small tables. Take a log backup before rebuilding indexes if your database is in FULL recovery model.
Defining an appropriate value of FILLFACTOR could also reduce the number of page splits and internal fragmentation. Remember that FILLFACTOR is applied only during index creation or rebuild. Unfortunately, there is no any universal FILLFACTOR value. As the DBA you should try to figure out the optimal FILLFACTOR value for different indexes. (Ref. 5 Things About Fillfactor)

No comments:

Post a Comment