Quantcast
Channel: Big Data topics (Netezza, Hadoop, etc)
Viewing all 20 articles
Browse latest View live

IBM Netezza analytics to analyze query history table usage

$
0
0

Using Netezza’s in-database analytics package FPGROWTH, database administrators can identify the most commonly used combination of tables and the performance of the queries that reference those sets of tables.

First, let’s see the most commonly used combination of tables.  Today, FPGROWTH requires that you specify a unique ID within which associations are discovered.  With query history data, the unique identifier is a composite key made up of NPSID, NPSINSTANCEID & OPID.  We have to create a single key that we can feed into the procedure, which can be done as shown below:

1. Create a sequence that we can use to generate a unique key
create sequence seq_qhist_id as bigint;

2. Next, create a table that associates each unique NPSID, NPSINSTANCEID & OPID with a newly generated sequence value.

create table tbl_qhist_id_assoc
as
select        *,
                  next value for seq_qhist_id qhist_id
from         ( select         distinct npsid,
                                                npsinstanceid,
                                                opid
                  from           poc_qhist_db..”$hist_table_access_1″
                ) x;

3. Create a view that takes the data in $hist_table_access_1 and associates each record with the appropriate ID generated in step 2

create view vw_table_access_stats
as
select      t.*,
               a.qhist_id
from        poc_qhist_db..”$hist_table_access_1″ t
inner join tbl_qhist_id_assoc a
                using ( npsid, npsinstanceid, opid )
;

4. Prep work is done…. now we can run FPGROWTH against the view and find which tables are determined to be used most frequently within the same queries.bg

QHIST_ANALYSIS(ADMIN)=> call nza..fpgrowth(‘intable=vw_table_access_stats,tid=qhist_id,item=TABLEID,supportType=absolute,support=1000,pfx=qhist_analysis_0315′);
NOTICE:
RUNNING FPGrowth algorithm:
DATASET : “VW_TABLE_ACCESS_STATS”
Transaction column : “QHIST_ID”
Item column : “TABLEID”
Group by : <none>
Minimum support : 1000  transactions
Min frequent itemset size : 0
Max frequent itemset size : 1000000000
Level of conditional dbs : 1
Result tables prefix : “QHIST_ANALYSIS_0315_FP_”

FPGROWTH
———-
5843
(1 row)

We can see that this produced a number of “sets” with the command:

\dt qhist_analysis_0315

Our FPGROWTH command was instructed to produce a series of tables with the prefix qhist_analysis_0315; so \dt will display all tables with that string at the start of the object’s name.

QHIST_ANALYSIS(ADMIN)=> \dt qhist_analysis_0315
List of relations
Name             | Type  | Owner
——————————+——-+——-
QHIST_ANALYSIS_0315_FP_SET1  | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET10 | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET11 | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET12 | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET2  | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET3  | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET4  | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET5  | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET6  | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET7  | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET8  | TABLE | ADMIN
QHIST_ANALYSIS_0315_FP_SET9  | TABLE | ADMIN

Let’s start with a chunk we can digest easily; sets of 5 objects.

QHIST_ANALYSIS(ADMIN)=> select * from QHIST_ANALYSIS_0315_FP_SET5 order by sup desc limit 10;
ITEM1  |  ITEM2  |  ITEM3  |  ITEM4  |  ITEM5  |  SUP   | GRP
———+———+———+———+———+——–+—–
1928132 | 1927319 | 1927115 | 1927921 | 1927709 | 103640 |   0
1927115 | 1928606 | 1927921 | 1927319 | 1927709 | 103635 |   0
1928132 | 1928606 | 1927115 | 1927921 | 1927709 | 103635 |   0
1928132 | 1928606 | 1927319 | 1927115 | 1927921 | 103635 |   0
1928132 | 1928606 | 1927319 | 1927115 | 1927709 | 103635 |   0
1928132 | 1928606 | 1927319 | 1927921 | 1927709 | 103635 |   0
5023 |    5006 |    5014 |    1260 |    5093 |  66411 |   0
5618 |    5014 |    1260 |    5093 |    5006 |  66371 |   0
5023 |    5618 |    5006 |    5014 |    1260 |  66331 |   0
5023 |    5618 |    5006 |    1260 |    5093 |  66331 |   0

We see some pretty dominant patterns of table usage in the first 6 rows of the result set.  What we care about, however, is a large set of objects involved in queries that tend to be slower.

SELECT
setid,
avg(EXTRACT(epoch from finishtime-submittime))
FROM
(
SELECT
*
FROM
(
SELECT
b.setid,
npsid,
npsinstanceid,
opid
FROM
poc_qhist_db..”$hist_table_access_1″ a,
(
SELECT
‘set-’||item1 ||’.'||item2 ||’.'||item3||’.'||item4||’.'||item5 setid,
*
FROM
QHIST_ANALYSIS_0315_FP_SET5
) b
WHERE
a.tableid IN (b.item1, b.item2, b.item3, b.item4, b.item5)
AND tableid > 200000
GROUP BY
setid,
npsid,
npsinstanceid,
opid
HAVING
COUNT(1) = 5
) a
JOIN poc_qhist_db..”$hist_query_prolog_1″ p
USING(npsid, npsinstanceid, opid)
JOIN poc_qhist_db..”$hist_query_epilog_1″ e
USING(npsid, npsinstanceid, opid)
) a
GROUP BY
1
ORDER BY
2 DESC

This produces a result like:

SETID                    |   AVG
———————————————+———-
set-1928132.1927319.1927115.1927921.1927709 | 7.195417
set-1927115.1928606.1927921.1927319.1927709 | 7.190187
set-1928132.1928606.1927319.1927921.1927709 | 7.190187
set-1928132.1928606.1927319.1927115.1927709 | 7.190187
set-1928132.1928606.1927115.1927921.1927709 | 7.190187
set-1928132.1928606.1927319.1927115.1927921 | 7.190187
set-2215266.1928606.2213789.2061762.2050344 | 1.659014
set-2215266.2050576.2213789.2061762.2050344 | 1.659014
set-2215266.2050576.1928606.2213789.2061762 | 1.659014
set-2213789.2050576.1928606.2050344.2061762 | 1.659014
set-2215266.2050576.1928606.2213789.2050344 | 1.659014
set-2215266.2050576.1928606.2061762.2050344 | 1.659014
set-972025.969171.969255.970648.968881      | 0.656696

Now, what is happening with this data is that there is one query that references 6 tables — but with each possible combination of 5 showed up several times.  This is why you see the same averages for 5 different rows.

 



Great read: “Predictive Analytics isn’t bad; bad marketing is bad”

$
0
0

Predictive analytics isn’t bad; bad marketing is bad

Came across this article and couldn’t help but recall my most recent experience with not bad … but horrible marketing.

Being a database guy and working in retail, nothing irks me more than when I get a marketing email that clearly has no science behind it.  Sure, the predictive piece is difficult but common sense isn’t.  For example: I recently bought a blazer at a well-respected retailer.  This was my very first purchase at this store and during the checkout process I provided my email address to get the receipt.  They now have me in their system!

A few days later I received my first bit of correspondence from them.

I’m not a huge fashion guy — but I was sure this wouldn’t match my brown blazer.  Over the next few weeks, I received more and more emails — none of which resulted in a purchase and all of which are more examples of bad marketing.

What a wasted opportunity!  With very little effort and even less science, this retailer should have known that people who buy mens blazers tend to be …. well, men.  And at the very least, their weekly marketing campaigns should be broken into two segments: men and women.

With the analytic functions Netezza offers, this particular retailer could have gotten a lot smarter and more precise by using FPGROWTH to  identify what are the top items purchased by other customers who also bought a blazer.  They could go even further and include geography in that analysis knowing that a shopper in Maryland behaves differently than a shopper in California.

Until our email providers offer a filter that not only eliminates spam but also off-the-marketing messages, bad marketing is something we’ll have to live with.


Error 126 when connecting SPSS Modeler to Netezza using 64bit ODBC driver

$
0
0

I installed IBM SPSS Modeler 14.1 today and the first thing I wanted to do was to connect it to Netezza and compare performance: in-database analytics and without.

It didn’t take long for me to run into my first issue — though not a serious one and very easily resolved.  I attempted to setup a new database source using an existing Netezza ODBC connection.  The error string was:

Failed to connect to data source admin@TwinFin6

IM003[160] Specified driver could not be loaded due to system error 126: The specified module could not be found. (NetezzaSQL, C:\Windows\SysWOW64\nsqlodbc.dll).

Failed to connect to datasource: TwinFin6;UID=admin;PWD=*****

Interesting, I knew I had installed the 64bit ODBC driver but I did confirm that nsqlodbc.dll was not present in SysWOW64 and only in C:\windows\system32.  I reran my Netezza ODBC installation and was careful to select the option to install both the 32- and 64-bit versions of the driver.  Only when I did this did the driver’s DLL find its way into SysWow64 and allowed me to connect SPSS to Netezza.


Netezza in-database modeling with SPSS Modeler 14.2: K-Means 35x faster

$
0
0

Leveraging Netezza’s in-database analytic capabilities can significantly reduce the amount of time required to execute SPSS streams.  By pushing the analytics to the data, we eliminate the need to pull the data out of the table and onto our SPSS server where the execution takes place.  For this reason we’re constantly compromising on how much data to analyze knowing that much of the time spent is simply moving data across the network.

Using K-Means as an example, I ran a test against the Netezza provided census income demo data.  I folded it over a few times to balloon the table to 12.5M records.  This isn’t a huge number of individuals to want to cluster but enough to better illustrate the value of in-database modeling.

Client environment:  Windows 7 64bit, SPSS Data Modeler 14.2 FP2, 8GB of RAM and quad-core Intel i5 2.67 GHz CPU, connected via VPN

Database environment: IBM Netezza 1000-12, 6.0.5 P5

Table details: 12,769,472 unique individual records containing income & demographic information

First things first: for the SPSS K-Means model to work, we first have to read the data so that the columns are properly recognized and thus usable.

This step alone — reading the data — took 700 seconds!

Next, we add the K-means model to the palette and customize it to create 10 clusters using a maximum of 5 iterations.  This is done by adjusting the clusters and iterations sections on the model and expert tabs.

When ready click run.  You’ll notice that once again we have to read through all of the records — all 12.5M individual records.  Once that is completed then we can begin processing them.  All told, it took 32m+ to read through all of the records and segment them into 10 clusters using 5 iterations.

And now the Netezza in-database model.  First, we’ll review the fields to ensure that the ID is properly recognized and that all other fields in the table are inputs.  Please note that we don’t have to read the data first.  Since the data is in-database, SPSS doesn’t really need to understand what the fields are or how they will be used.

One difference between Netezza’s in-database K-Means and SPSS is that Netezza stores the results in a table.  For this reason you’ll need to specify a table name to store the resulting cluster summary.

Next we indicate that we’d like 10 clusters identified and a maximum of 5 iterations — just as we did with the SPSS version of K-Means.  Once done click run and watch the clock.

In the screenshot below we can see the entire process took 58 seconds to complete (59 if you round up).


The relationship between groom and nzbackup

$
0
0

There are three basic functions that every Netezza DBA must perform regularly:

  1. Ensure statistics are up to date
  2. Groom your tables
  3. Backup your production databases

Let’s focus on groom and its dependency on nzbackup.  I recently ran into an issue where groom was running every day — but none of the deletes/updates were getting removed.

When any backup operation is run, a new entry is created in _t_backup_history recording the type of backup (0 – full, 1 – differential, 2 – cumulative, 4 – schema only, etc), when and for what database. Another key piece of information captured is the backup operation’s transaction id.

Quick piece of background: every record in a Netezza system has four hidden columns:

  1. createxid is the transaction id of the operation that created the record
  2. deletexid is the transaction id of the operation that deleted the record.
  3. datasliceid is the data slice (disk) that the record resides on
  4. rowid  is the unique row identifier (unique to the entire system)

Netezza logically deletes records by populating the deletexid column with the delete operation’s transaction id. This is an instruction to the FPGA to not allow these records past, eliminating visibility to them completely.

So why is the last backup operation ID important?  This information is used by subsequent differential and cumulative backups to identify newly inserted records, deletes and updates. Records logically deleted since the last backup operation are recorded, ensuring a restore of that increment will result in those records properly being deleted.

Groom also uses this information to ensure that all logical deletes were recognized by a backup before it actually physically moves the row. Consider this example:

1. A table has 5 million rows in it when it gets backed up on Sunday night
2. A user deletes all of the rows (not truncate) on Monday morning
3. A groom operation is run

In this example none of the deletes will be physically removed and the table will not shrink.   To put it simply: a logically deleted record is eligible to be physically removed only if the deletexid is less than the last backup’s operation ID.  If it isn’t then groom will leave the record in place (by default).  You can override this behavior by adding ‘reclaim backupset none’  to your GROOM command.  This instructs groom to ignore any existing backup sets and physically remove any logical deletes.

PLEASE NOTE: doing so will mean the next differential backup requested will recognize that a groom was forced and perform a full backup for any tables impacted.

So how did this come up?  Well, it turns out a full backup had been run months back to test the backup script.  No differential or cumulative backups had been run since – meaning the last backup operation hadn’t been updated since.  All of the deletes and updates that had occurred since then weren’t eligible for grooming as a result and the tables slowly grew on disk until we discovered our mistake.

I wrote a script to help identify this situation for future use. It takes a single parameter: database name.  Like anything you find on the net, test it in your environment.

####################################################################
#!/bin/sh

set -- `getopt -a -u -l database: h $*`

while [[ $1 != "--" ]] ; do
     case $1 in
          "--database" ) DATABASE=$2 ; shift ;;
     esac
     shift
done

if [ -z "${DATABASE}" ] ; then
     echo "ERROR: You must specify a database to review with -database DATABASE."
     exit 1
fi

DB_OBJID=`/nz/support/contrib/bin/nz_get_database_objid ${DATABASE}`

if (( ${DB_OBJID:=0} == 0 )) ; then
     echo "ERROR: There is not a valid database with the name ${DATABASE}."
     exit 2
fi

DB_LAST_OPID=`nzsql -t -A -c "select max(opxid) from _t_backup_history where dbname = ^$DATABASE^ and type in ( 0,1,2)"`

if (( ${DB_LAST_OPID:=0} == 0 )) ; then
     echo "ERROR: Could not find any full, diff or cumulative backups for database ${DATABASE}."
     exit 3
fi

nzsql -t -A -c "select rpad(' tablename',50),rpad(' ungroomable',15), rpad(' visible rows',15), rpad(' total rows',15)"
nzsql -t -A -c "select rpad('-',50,'-'),rpad('-',15,'-'), rpad('-',15,'-'), rpad('-',15,'-')"

nzsql ${DATABASE} -c "\dt" -t -A | awk -F \| '{ print $1 }' | while read TABLENAME ; do
     nzsql ${DATABASE} -t -A  <<eof
          \o /dev/null
          set show_deleted_records = 1;
          \o
          select rpad('${TABLENAME}',50) ,
                 lpad(nvl(sum(case when deletexid > ${DB_LAST_OPID} then 1 else 0 end),0),15),
                 lpad(nvl(sum(case when deletexid = 0 then 1 else 0 end),0),15),
                 lpad(count(1),15)
          from   ${TABLENAME}
eof
done

exit 0
####################################################################

Using nz_zonemap to visualize Netezza’s zone map effectiveness

$
0
0

Netezza has a lot of tools in /nz/support/contrib/bin that make life for the NZ DBA much, MUCH easier.  One such tool is nz_zonemap.
Zone maps are how the system keeps track of what records exist in a particular extent (3MB unit of storage).  Each extent will have the minimum and maximum value recorded for each of the columns zone map information exists.  As new data gets written to the table and new extents are allocated, more zone map information gets captured.

When users are querying the data, zone maps are used to immediately eliminate all of the extents that we KNOW don’t have the data we are interested in.  The more selective zone maps are (narrow min/max ranges) the more effective they become at eliminating disk I/O.  Generally speaking, larger tables should be organized on the one or two most commonly occurring restrictions.

To get a list of the columns that are currently zone mapped, run the script with only the database and table name as parameters.

[nz@netezza~]$ /nz/support/contrib/bin/nz_zonemap foo pos_txn_dtl

   Database: FOO
Object Name: POS_TXN_DTL
Object Type: TABLE
Object ID  : 1183666

The zonemappable columns are:

 Column # | Column Name | Data Type
----------+-------------+-----------
        1 | TXID        | BIGINT
        2 | PRODUCTID   | BIGINT
        3 | CUSTOMERID  | INTEGER
(3 rows)

The output here indicates that zone maps exist for the txid, productid and customerid columns.  We could then look at the minimum/maximum range on each extent for a given data slice.  By default, data slice 1 is used.  You could override this default by specifying a particular data slice with the option ‘-dsid NN’.  Using the customerid column in this example, the output looks like:

[nz@netezza~]$ /nz/support/contrib/bin/nz_zonemap foo pos_txn_dtl customerid

   Database: FOO
Object Name: POS_TXN_DTL
Object Type: TABLE
Object ID  : 1183666
 Data Slice: 1
   Column 1: CUSTOMERID  (INTEGER)

 Extent # | CUSTOMERID (Min) | CUSTOMERID (Max) | ORDER'ed
----------+------------------+------------------+----------
        1 | 3                | 199995           |
        2 | 3                | 199995           |
        3 | 3                | 199995           |
        4 | 3                | 199995           |
        5 | 3                | 199995           |
        6 | 3                | 199995           |
        7 | 3                | 199995           |
        8 | 3                | 199995           |
        9 | 3                | 199995           |
       10 | 3                | 199995           |
       11 | 3                | 199995           |
       12 | 3                | 199995           |
       13 | 3                | 199995           |
       14 | 3                | 199995           |
(14 rows)

This tells me that on data slice 1, any query restricting on customerid for some value or range of values will essentially have to perform a table scan since the min/max range on each extent is so wide.  We can narrow this down one of two ways:  clustered base table or re-order the table with a CTAS statement.  Before we do either, I’ll get the elapsed time for a query against this table; it should be quick.

FOO(ADMIN)=> \time
Query time printout on
FOO(ADMIN)=> select count(1) from pos_txn_dtl where customerid = 100000;
 COUNT
-------
   800
(1 row)

Elapsed time: 0m0.468s

So it is taking us 1/2 a second to query this table (it is very small).  If this type of query ran thousands of times per day than even a 1/10th second saving could add up.  Next, let’s organize the table on customerid and then groom it.

FOO(ADMIN)=> alter table pos_txn_dtl organize on (customerid);
ALTER TABLE
Elapsed time: 0m0.357s
FOO(ADMIN)=> groom table pos_txn_dtl;
NOTICE:  Groom processed 14792 pages; purged 0 records; scan size shrunk by 528 pages; table size shrunk by 48 extents.
GROOM ORGANIZE READY
Elapsed time: 0m9.764s

So our table is now organized on this column.  Doing so netted us another advantage: the table size shrunk!  The columnar compression techniques Netezza uses became more effective when this table’s sort order on disk changed.

Taking a look at the output of nz_zonemap now should show something very different.

[nz@netezza~]$ /nz/support/contrib/bin/nz_zonemap foo pos_txn_dtl customerid

   Database: FOO
Object Name: POS_TXN_DTL
Object Type: TABLE
Object ID  : 1183666
 Data Slice: 1
   Column 1: CUSTOMERID  (INTEGER)

 Extent # | CUSTOMERID (Min) | CUSTOMERID (Max) | ORDER'ed
----------+------------------+------------------+----------
        1 | 3                | 15531            |
        2 | 15542            | 30818            | TRUE
        3 | 30818            | 47035            | TRUE
        4 | 47035            | 62320            | TRUE
        5 | 62320            | 77965            | TRUE
        6 | 77965            | 93140            | TRUE
        7 | 93140            | 108599           | TRUE
        8 | 108599           | 123914           | TRUE
        9 | 123914           | 138584           | TRUE
       10 | 138584           | 155353           | TRUE
       11 | 155353           | 170317           | TRUE
       12 | 170317           | 185599           | TRUE
       13 | 185599           | 199995           | TRUE
(13 rows)

Now the min/max ranges are very, very narrow and the ORDERED indicator shows true — meaning the minimum value on this extent is greater than or equal to the maximum value of the previous extent.  Queries against this table restricting on customerid should now scan as little data as is possible to find the relevant records.

FOO(ADMIN)=> \time
Query time printout on
FOO(ADMIN)=> select count(1) from pos_txn_dtl where customerid = 100000;                                                              COUNT
-------
   800
(1 row)

Elapsed time: 0m0.054s

Our query went from .46 seconds to .05 seconds — or 9x faster.  This same approach can be used with much larger tables; the key here is that you are able to visualize the distribution of values across zone map entries using the nz_zonemap tool.

Now, where to start your research into which tables to optimize is another story.  You could start with http://wp.me/p1U3hY-A to identify average query response time by table.


Dropping a Netezza Analytics model

$
0
0

The result of most of Netezza’s analytic functions are a series of tables.  The tables produced vary from function to function.  We schedule jobs to run the analytic functions and want to reuse the same model name so that our business intelligence tools can simply query the resulting tables.  Identifying the tables and dropping them manually is an option; so is using symbolic links and generating a new model every day.

Instead, Netezza provides a very easy way to do clean up an existing model using their drop_model function.  It takes a single parameter:model.  For example:

FOO(ADMIN)=> call nza..arule('model=mbasket,
                              intable=pos_txn_dtl,
                              tid=customerid,
                              item=productid,
                              support=10,
                              supportType=absolute,
                              maxsetsize=3);
ERROR:  The view NZA_META_PRIV_MBASKET already exists. Choose another model name.

Here my analytic function failed because the model already existed.  To remove all of the results this previous execution created I run the following:

FOO(ADMIN)=> call nza..drop_model('model=mbasket');
NOTICE:  Dropped: MBASKET
 DROP_MODEL
------------
 t
(1 row)

Now I can rerun my analytic function without having to worry about existing tables.


Enabling Netezza Analytics for use in a database

$
0
0

Many of Netezza’s analytic functions require the presence of some metadata tables/views in order to work.  These tables/views are created when you initialize the analytic libraries in a particular database.

SYSTEM(ADMIN)=> call nza..kmeans('intable=t1,id=int1,k=10,model=kmeans,outtable=out,maxiter=5');
ERROR:  The metadata tables are not initialized. Please initialize: call nza..initialize();

Following this instruction generates this error:

SYSTEM(ADMIN)=> call nza..initialize();
NOTICE:  A "CREATE TABLE", "CREATE VIEW", or "GRANT" statement did not succeed.
You must enable your database for INZA first (run the script "create_inza_db.sh ").
For details, see the installation guide.

ERROR:  CREATE TABLE: permission denied.

In order to initialize the database properly, you have to call a shell script that comes with the analytics package.  That script is /nz/export/ae/utilities/bin/create_inza_db.sh.  It takes the database as a single parameter and creates several groups to manage privileges for that database.  Those groups are:

INZAUSERS — users who can execute analytic functions

INZADEVELOPERS — users who can create analytic functions

INZAADMINS — users who can create and manage other user’s analytic functions

In addition to this, the script will initialize the metadata tables properly.  The output of the script will look like:

[nz@netezza~]$ /nz/export/ae/utilities/bin/create_inza_db.sh foo
CREATE GROUP
CREATE GROUP
CREATE GROUP
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
REVOKE
ALTER USER
ALTER USER
 INITIALIZE
------------
 t
(1 row)

                     INITIALIZE
----------------------------------------------------
 The metadata objects are successfully initialized.
(1 row)


Identifying idle transactions on a Netezza system

$
0
0

There are a couple of reasons why you might want to know about old(ish) transactions that are in an idle state.

  1. Any reference to an object between the begin/commit requires some type of lock on that object.  The lock is only cleared once the transaction completes.
  2. Netezza uses visibility lists to govern what records should be visible to each session.  A transaction idle for many days could prevent a user or process from doing something with a row that it might otherwise have access to.
  3. There is some impact on the DBAs ability to administer the system.  For example: if I run the groom command, which physically removes the logical deletes from a table, this command will not be able to permanently delete a record that was deleted in a transaction that occurred AFTER the idle transaction began.

More often than not, this situation occurs when a user explicitly begins a transaction and neglects to commit it.

So how do we identify these transactions?  One way is to run nzsession and manually review the state and initial connection time of each session.  nzsession will report the sessions we’re interested in as being tx-idle.  This simply means that a begin was issued but that there is currently no active SQL interaction running.  This helps but I like to automate whenever possible so have the underlying SQL is more helpful.  The following could be run daily, hourly — whatever.  The SQL will report the session ID, connection time, user name, database, IP address and the last SQL command run for any session that is currently in a tx-idle state and began more than 6 hours ago.

select  *, 
        conn_actual < current_timestamp
from (  select
        id,
        conntime,
        username,
        dbname,
        ipaddr,
        command,
        '1970-01-01'::timestamp + extract(epoch from conntime) - 7200 conn_actual
    from     _v_session
    where    status = 'tx-idle'
) x
where     conn_actual < current_timestamp - '5 minutes'::interval;
and       status = 'tx-idle';

To change the age of the transactoins reported by this query, change the value 7200 to the number of seconds appropriate for your requirement (e.g. 86400 for 1 day old transactions).


Displaying plan files on Netezza V7.0

$
0
0

Netezza used to store query execution plans as ascii files in the /nz file system.  Active plans would be kept in /nz/data/plans and completed plans kept in /nz/kit/log/planshist/current/.  The problem with this was that anytime you wanted a plan file you would have to log onto the system to collect it.

 

Now, there is a SQL interface to capture the actual execution plan.  Beginning v7.0, you can type following from your SQL client (where 78752 is the plan #):

 

show planfiles 78752;

 

This will dump the plan file to your screen.  If using nzsql, prefix this with:

 

\o /tmp/78752.pln

 

Or re-direct it to a file.  If using Aginity or another ODBC/JDBC/OLEDB SQL client, copy and paste the output to a file for further analysis.


Configuring pig to work with a remote Hadoop cluster

$
0
0

1. First, download a stable release of Pig from here.

2. As root (or some other privileged user), untar the pig tarball to /usr/local; this will create a sub-directory like /usr/local/pig.0.11.1.

3. Create a symbolic link (to make things easier)

ln -s /usr/local/pig.0.11.0 /usr/local/pig

4. Update your .bashrc or .profile to include:

export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin

5.  Contact your Hadoop administrator (or get it yourself if you have access) and create a tarball containing the necessary client files:

cd $HADOOP_HOME
tar -czvf client.tar.z core-site.xml hadoop-env.sh hdfs-site.xml log4j.properties mapred-site.xml ssl-client.xml.example

6.  Now, create a new directory (either in your home-directory or some place else if others are going to need to access this:

mkdir hadoop.conf
cd hadoop.conf
tar -zxvf ../client.tar.z .

7.  Now update your .profile or .bashrc to include this line:

export HADOOP_CONF_DIR=$HOME/hadoop.conf

8.  If it isn’t already, export JAVA_HOME in your .profile or .bashrc:

export JAVA_HOME=/usr/local/jdk1.7.0_17

9.  Run pig in interactive mode (but mapreduce execution):

pig -x mapreduce

10.  Test it all out with an actual pig script.  Copy and paste the following into wordcount.pig:

documents = LOAD '/user/hduser/foo_input/*.txt' as line;
words = foreach documents generate flatten(TOKENIZE(line)) as word;
grpd = group words by word;
cntd = foreach grpd generate group, COUNT(words);
dump cntd;Change the directory to something on your HDFS that actually has a bunch of text documents.

Run it:

pig -x mapreduce wordcount.pig

If everything is setup correctly, you’ll get a listing of words encountered and the number of times they were encountered.


Could not infer the matching function for org.apache.pig.builtin.SUM (or any function for that matter)

$
0
0

Pig – the language – may be like Pig – the animal – when it comes to ingesting data (not very picky), but syntax certainly does matter.  I learned this tonight while experimenting with Pig.  My script was pretty simple:

1. Load some data

2. Filter that data

3. Group that data

4. Aggregate that data

5. Sort that data

6. Limit the data to the top 10

7. Dump the data

What I didn’t realize is that the variable that populates one variable is very important – when it comes to aggregation and inferring the appropriate data type.  For example, this will fail with the error found in the post heading:

mydata = load '/user/hduser/foo.dat' using PigStorage(',') as (person:chararray,val:int);
mydata_filtered = filter mydata by val > 1;
group_mydata = group mydata_filtered by ( person );
sumval = foreach group_mydata generate group, SUM(mydata.val);
dump sumval;

The reason is that the variable that was used to populate group_mydata is mydata_filtered and not mydata.  This was a pretty sloppy mistake that I made when adding additional logic to filter the data AFTER the original script was written.  So if you get this error:

Could not infer the matching function for org.apache.pig.builtin.SUM

Confirm you are aggregating the column from the appropriate Pig relation and not an earlier created one (though it may be the parent!).


Passing parameters to Pig scripts

$
0
0

The Pig scripting language Pig Latin allows for parameter substitution at run-time.  Like any script, the ability to define parameters makes it far easier to share code with other users.  To do this in Pig Latin, you simply modify your script as shown below:

ALL_PLAYER_STATS = load '/user/hduser/baseball/batting/*.csv' using PigStorage (',') as ( playerID:chararray, teamID:chararray, yearID:int);
FILTERED_TEAM = filter ALL_PLAYER_STATS by teamID == '$TEAMID';
dump FILTERED_TEAM;

Then, when you want to execute your script you specify the value like this:

pig -x mapreduce -p TEAMID=BOS batting.pig

Failure to specify the parameter at run-time will throw the following error:

ERROR org.apache.pig.Main - ERROR 2999: Unexpected internal error. Undefined parameter : TEAMID

You can specify as many parameters as you like with the -p option.  For example:

ALL_PLAYER_STATS = load '/user/hduser/baseball/batting/*csv' using PigStorage (',') as
    ( playerID:chararray, teamID:chararray, yearID:int);
FILTERED_TEAM = filter ALL_PLAYER_STATS by teamID == '$TEAMID' and yearID >= $MIN_YEAR; 
dump FILTERED_TEAM;

Notice that because yearID is defined as int in my schema that I dropped the single quotes around it.  Failure to do so will cause Pig to treat the value as a string – which will not match the type defined in the schema.  The error you’ll see is:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1039: 
<file test.pig, line 2, column 50> In alias FILTERED_TEAM, incompatible 
types in GreaterThanEqual Operator left hand side:int right hand side:chararray

To execute the multi-parameter script:

pig -x mapreduce -p TEAMID=NYY -p YEARID=2001 batting.pig


Passing parameters to Hive scripts

$
0
0

Like Pig and other scripting languages, Hive provides you with the ability to create parameterized scripts – greatly increasing the re-usability of the scripts.  To take advantage, write your Hive scripts like this:

select yearid, sum(HR)
from   batting_stats
where  teamid = '${hiveconf:TEAMID}' 
group  by yearid
order  by yearid desc;

Note that the restriction on teamid is ‘${hiveconf:TEAMID}’ rather than an actual value.  This is an instruction to read this variable’s value from the hiveconf namespace.  When you execute the script, you’ll run it as shown below:

hive -f batting.hive -hiveconf TEAMID='LAA'

If you define the parameter in the script but fail to specify a value at run-time, you won’t get any error like you would with Pig.  Instead, the restriction effectively becomes “where teamid = ””.  If you have blanks then you might get a result back; if not, you’ll go through all the necessary mechanics of executing the script sans the results.


Hive’s collection data types

$
0
0

Hive offers several collection data types: struct, map and array. These data types don’t necessarily make a lot of sense if you are moving the data from the well-structured world of the RDBMS but if you are working directly with application generated data or data that is less-structured then this could be a great capability to have in your arsenal.

struct, like in most programming languages, allows you to define a structure with established columns and data types. For example, a column could be called address and be declared as:

address struct<street:string, city:string, state:string, zipcode:int>

When referring to these columns, you would reference it like address.street.

map is a little less structured; instead of predefining the sub-attributes of this column you define a key-value and declare the data type for each. For example, an acceptable map could be:

preferences map<pref_code string, pref_value string>

This gives you the flexibility to add really whatever you want – so long as the first value (key) is the right data type and the second value (actual value) matches also.

select preferences["email_offers"] from dim_customer;

Finally, array allows you to store n number of values of the same data type – and functionally speaking the same type of business object, too.  In other words, you wouldn’t use an array unless the objects represented the same type of information and using the same data type.  An example where an array could be used:

household_ages array[smallint]

You can put all this together into a single example to see how one might use this — again given the existing structure of the data.  You probably wouldn’t convert existing structured data into this type of format .

create table dim_customer
 (
     customer_id         bigint,
     customer_name    struct<fname:string, lname:string>,
     customer_addr    struct<street:string, city:string, state:string, zip:int>,
     household_ages    array<smallint>,
     email_prefs            map<string, boolean>
 )
 row format delimited 
 fields terminated by '|'     -- This is how each field is seperated
 collection items terminated by ','   -- this is how values in the struct, map and array are seperated
 map keys terminated by ':'  -- This is how the keys in map data type are seperated from their values
 lines terminated by '\n' stored as textfile; 

Your input data – using the delimiters above – would then look like this:

12345|John,Smith|123 Main St,New York, NY, 00000|45,40,17,13|weekly_update:true,special_clearance:true,birthday_greeting:false

And could be loaded with:

load data local inpath '/tmp/dim_customer'.dat' overwrite into table dim_customer;


To copy or move: Implications of loading Hive managed table from HDFS versus local filesystem

$
0
0

When using the load function to populate a Hive table, it’s important to understand what Hive does with the actual data files when the input data resides on your local file system or on the HDFS file system.

For example, to load data from your local home directory into a Hive table:

hive> LOAD DATA LOCAL INPATH '/home/username1/weather/input' INTO TABLE weather_data;

You’ll actually see write in the output messages like:

Copying data from file:/home/hduser/weather_data/input
Copying file: file:/home/hduser/weather_data/input/weather.16.csv
Copying file: file:/home/hduser/weather_data/input/weather.86.csv
...
...
Copying file: file:/home/hduser/weather_data/input/weather.52.csv
Copying file: file:/home/hduser/weather_data/input/weather.37.csv
Loading data to table default.weather_data

Under the covers, Hive will actually copy the files found in /home/username1/weather into the HDFS directory associated with the table weather_data (e.g. /user/hive/warehouse/weather_data/). If you want to see what that directory is, run the following hive command:

hive> describe extended weather_data;

Look for the ‘location’ value.

If that data was already on the HDFS file system, however, Hive would employ a move and not a copy.  For example:

hduser@hadoop1:/home/hduser/$ hadoop dfs -ls /user/hduser/weather_data/ | wc -l
101
hive> load data inpath '/user/hduser/weather_data/' into table weather_data;

Now, let’s check the output of dfs -ls | wc -l

hduser@hadoop1:~/weather_data$ hadoop dfs -ls weather_data | wc -l
0

As you can see, the files were physically moved from /user/hduser/weather_data into the location associated with the Hive table.


Pig workflow optimization: splitting data flows

$
0
0

Pig supports the concept of non-linear data flows, where you have a single input but multiple outputs.  Pig’s optimizer is smart enough to recognize when the same input is referenced multiple times and implicitly splits those data flows.  You can explicitly do it with the split function as shown below.  Personally, I prefer this approach because it seems slightly easier to maintain.  

An example of the optimizer implicitly splitting the flow is creating multiple Pig relations from the same input using different criteria and the filter function.

state_info = load '/user/hduser/geography/*.csv' using PigStorage(',') as ( stateID:chararray, population:int, timezone:charray);
pst_states = filter state_info by timezone == 'PST';
mst_states = filter state_info by timezone == 'MST';
cst_states = filter state_info by timezone == 'CST';
est_states = filter state_info by timezone == 'EST';

The explicit approach is to use the split function.  That would look like this:

state_info = load '/user/hduser/geography/*.csv' using PigStorage(',') as ( stateID:chararray, population:int, timezone:charray);
split state_info into
     mst_states if timezone == 'MST',
     pst_states if timezone == 'PST',
     cst_states if timezone == 'CST',
     est_states if timezone == 'EST';

 

 


Securing (and sharing) password information in Sqoop jobs

$
0
0

Sqoop is a utility that allows you to move data from a relational database system to an HDFS file system (or export from Hadoop to RDBMS!).  One of the things to keep in mind as you start building Sqoop jobs is that the password information shouldn’t be passed via the command line.

Sqoop has a couple of ways to secure this information, one of which is creating a more secure parameters file that you pass to Sqoop at runtime.  For example:

1. Create a file containing the connection string information in your UNIX/Linux home directory:

--connect jdbc:postgresql://mypostgres.server.com:5432/mydatabase 
--user hduser
--password 'password'

2.  Secure that file by changing the permissions to owner read-only

chmod 400

3.  Modify the appropriate Sqoop jobs to use this file

sqoop import --table mytable --options-file pg.parms

Another way to secure this information is with a password file stored on the HDFS file system itself; writing that one up next.


“Error occurred while loading translation library” when connecting R to IBM Netezza

$
0
0

When connecting my R-2.15 client to IBM Netezza v7 (NZA 2.5.4) for the first time, I got the error above.  Here was the connect call:

>nzConnectDSN("NZSQL")
Error in odbcDriverConnect("DSN=VirtualNZ") : 
  (converted from warning) [RODBC] ERROR: state HY000, code 45, message Error occurred while loading translation library

Check to make sure you are running the 64bit version of R client.  Once I switched to that version the error went away.  It might be that installing the 32bit Netezza ODBC driver resolves this also.  In my case, I wasn’t interested in getting the 32bit client working.


Using R to improve your fantasy football team

$
0
0

So I’ve started playing around with R and this week decided to see if I could more intelligently add a player to my team from the ranks of free agency.  The position I needed to fill?  The kicker.

The first thing I did was to grab the YTD data for kickers and store it into a space delimited text file in the /tmp directory called kickers:

sgost 12 6 14 14 7 14 10 10 13 
nfolk 12 4 10 9 13 6 13 13 16
mprat 7 13 17 12 18 5 10 9 4
mcros 4 8 13 20 15 6 15 8 5
dbail 13 14 6 2 12 5 17 11 5
shaus 7 9 9 13 16 13 2 9 19
rsucc 4 5 13 9 15 6 5 12 12
avina 3 7 10 14 10 13 16 6 0
dcarp 3 12 13 11  8 11 4 7 4
nnova 4 18 5 13 4 15 6 6 8
ahenr 8 12 3 7 20 2 1 7 9
gcano 1 11 10 8 5 14 7 12 5
mnuge 3 9 2 7 9 10 11 10 4
pdaws 9 3 1 3 10 13 8 6 12
cstur 13 8 12 5 11 2 5 8 8

Next, I opened R.  I am running this on Linux – so I simply typed ‘R’ to launch the client.

First things first, we need to load the data into a list.

kickers <- read.table('/tmp/kickers', col.names=c('kicker','week1','week2','week3','week4','week5','week6','week7','week8','week9'))

This produces a list with the aforementioned column names.  But this is doesn’t really give us any idea as to how consistent a kicker is or how their averages may have been affected by weeks where they significantly outperformed their typical performance.  I decided to use a boxplot to visualize the data for this purpose.

To do that, I needed to transpose the data so that each kicker had their own column with YTD results.  R has the ‘t’ function for this very purpose.

tkicker = t(kickers[,-1])

This creates a new list with all of the data values – but excludes the first column which contained the kicker’s name; we have plans for that:

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
week1   12   12    7    4   13    7    4    3    3     4     8     1     3
week2    6    4   13    8   14    9    5    7   12    18    12    11     9
week3   14   10   17   13    6    9   13   10   13     5     3    10     2
week4   14    9   12   20    2   13    9   14   11    13     7     8     7
week5    7   13   18   15   12   16   15   10    8     4    20     5     9
week6   14    6    5    6    5   13    6   13   11    15     2    14    10
week7   10   13   10   15   17    2    5   16    4     6     1     7    11
week8   10   13    9    8   11    9   12    6    7     6     7    12    10
week9   13   16    4    5    5   19   12    0    4     8     9     5     4
      [,14] [,15]
week1     9    13
week2     3     8
week3     1    12
week4     3     5
week5    10    11
week6    13     2
week7     8     5
week8     6     8
week9    12     8

Next, let’s add the kicker’s name as a column header:

colnames(tkicker) <- kickers$kicker

Now we have something we can boxplot:

boxplot(tkicker)

This produces the following graphic:

Screenshot from 2013-11-15 11:27:02

Good – but it doesn’t stand out.  Let’s add some color to this graphic:

boxplot(tkicker, col=colors())

Screenshot from 2013-11-15 11:35:28

From this, we can see Stephen Gostkowski is by far the most consistent kicker with a narrow IQR and a higher than most median average.  Unfortunately – if your league is like my league – he isn’t available.  From the list, however, I could see Nick Folk was a good choice.  His median is also quite high though he has some outlying performances that brought that value up.  His IQR was still relatively narrow – and he did have some poor performances against the Steelers and Patriots.  I picked him up based on this visual; I’ll let you know how I made out next week.


Viewing all 20 articles
Browse latest View live