Using Netezza’s in-database analytics package FPGROWTH, database administrators can identify the most commonly used combination of tables and the performance of the queries that reference those sets of tables.
First, let’s see the most commonly used combination of tables. Today, FPGROWTH requires that you specify a unique ID within which associations are discovered. With query history data, the unique identifier is a composite key made up of NPSID, NPSINSTANCEID & OPID. We have to create a single key that we can feed into the procedure, which can be done as shown below:
1. Create a sequence that we can use to generate a unique key
create sequence seq_qhist_id as bigint;
2. Next, create a table that associates each unique NPSID, NPSINSTANCEID & OPID with a newly generated sequence value.
create table tbl_qhist_id_assoc
as
select *,
next value for seq_qhist_id qhist_id
from ( select distinct npsid,
npsinstanceid,
opid
from poc_qhist_db..”$hist_table_access_1″
) x;
3. Create a view that takes the data in $hist_table_access_1 and associates each record with the appropriate ID generated in step 2
create view vw_table_access_stats
as
select t.*,
a.qhist_id
from poc_qhist_db..”$hist_table_access_1″ t
inner join tbl_qhist_id_assoc a
using ( npsid, npsinstanceid, opid )
;
4. Prep work is done…. now we can run FPGROWTH against the view and find which tables are determined to be used most frequently within the same queries.bg
QHIST_ANALYSIS(ADMIN)=> call nza..fpgrowth(‘intable=vw_table_access_stats,tid=qhist_id,item=TABLEID,supportType=absolute,support=1000,pfx=qhist_analysis_0315′);
NOTICE:
RUNNING FPGrowth algorithm:
DATASET : “VW_TABLE_ACCESS_STATS”
Transaction column : “QHIST_ID”
Item column : “TABLEID”
Group by : <none>
Minimum support : 1000 transactions
Min frequent itemset size : 0
Max frequent itemset size : 1000000000
Level of conditional dbs : 1
Result tables prefix : “QHIST_ANALYSIS_0315_FP_”
FPGROWTH
———-
5843
(1 row)
We can see that this produced a number of “sets” with the command:
\dt qhist_analysis_0315
Our FPGROWTH command was instructed to produce a series of tables with the prefix qhist_analysis_0315; so \dt will display all tables with that string at the start of the object’s name.
QHIST_ANALYSIS(ADMIN)=> \dt qhist_analysis_0315
List of relations
Name | Type | Owner
——————————+——-+——-
QHIST_ANALYSIS_0315_FP_SET1 | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET10 | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET11 | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET12 | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET2 | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET3 | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET4 | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET5 | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET6 | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET7 | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET8 | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET9 | TABLE | ADMIN
Let’s start with a chunk we can digest easily; sets of 5 objects.
QHIST_ANALYSIS(ADMIN)=> select * from QHIST_ANALYSIS_0315_FP_SET5 order by sup desc limit 10;
ITEM1 | ITEM2 | ITEM3 | ITEM4 | ITEM5 | SUP | GRP
———+———+———+———+———+——–+—–
1928132 | 1927319 | 1927115 | 1927921 | 1927709 | 103640 | 0
1927115 | 1928606 | 1927921 | 1927319 | 1927709 | 103635 | 0
1928132 | 1928606 | 1927115 | 1927921 | 1927709 | 103635 | 0
1928132 | 1928606 | 1927319 | 1927115 | 1927921 | 103635 | 0
1928132 | 1928606 | 1927319 | 1927115 | 1927709 | 103635 | 0
1928132 | 1928606 | 1927319 | 1927921 | 1927709 | 103635 | 0
5023 | 5006 | 5014 | 1260 | 5093 | 66411 | 0
5618 | 5014 | 1260 | 5093 | 5006 | 66371 | 0
5023 | 5618 | 5006 | 5014 | 1260 | 66331 | 0
5023 | 5618 | 5006 | 1260 | 5093 | 66331 | 0
We see some pretty dominant patterns of table usage in the first 6 rows of the result set. What we care about, however, is a large set of objects involved in queries that tend to be slower.
SELECT
setid,
avg(EXTRACT(epoch from finishtime-submittime))
FROM
(
SELECT
*
FROM
(
SELECT
b.setid,
npsid,
npsinstanceid,
opid
FROM
poc_qhist_db..”$hist_table_access_1″ a,
(
SELECT
‘set-’||item1 ||’.'||item2 ||’.'||item3||’.'||item4||’.'||item5 setid,
*
FROM
QHIST_ANALYSIS_0315_FP_SET5
) b
WHERE
a.tableid IN (b.item1, b.item2, b.item3, b.item4, b.item5)
AND tableid > 200000
GROUP BY
setid,
npsid,
npsinstanceid,
opid
HAVING
COUNT(1) = 5
) a
JOIN poc_qhist_db..”$hist_query_prolog_1″ p
USING(npsid, npsinstanceid, opid)
JOIN poc_qhist_db..”$hist_query_epilog_1″ e
USING(npsid, npsinstanceid, opid)
) a
GROUP BY
1
ORDER BY
2 DESC
This produces a result like:
SETID | AVG
———————————————+———-
set-1928132.1927319.1927115.1927921.1927709 | 7.195417
set-1927115.1928606.1927921.1927319.1927709 | 7.190187
set-1928132.1928606.1927319.1927921.1927709 | 7.190187
set-1928132.1928606.1927319.1927115.1927709 | 7.190187
set-1928132.1928606.1927115.1927921.1927709 | 7.190187
set-1928132.1928606.1927319.1927115.1927921 | 7.190187
set-2215266.1928606.2213789.2061762.2050344 | 1.659014
set-2215266.2050576.2213789.2061762.2050344 | 1.659014
set-2215266.2050576.1928606.2213789.2061762 | 1.659014
set-2213789.2050576.1928606.2050344.2061762 | 1.659014
set-2215266.2050576.1928606.2213789.2050344 | 1.659014
set-2215266.2050576.1928606.2061762.2050344 | 1.659014
set-972025.969171.969255.970648.968881 | 0.656696
Now, what is happening with this data is that there is one query that references 6 tables — but with each possible combination of 5 showed up several times. This is why you see the same averages for 5 different rows.
