[Mod_log_sql] Re: Table indexing

Matt Erbst merbst at cox.net
Sun Jun 12 01:10:57 EST 2005


Well the reason for the explicit locks is to make sure no records are 
deleted from the original table which have not been inserted into the 
archive. (on a high traffic server, 100s of records could be written in 
the time it takes to insert all those records into a new table)

INSERT IGNORE would succeed at preventing duplicates, although only by 
checking the primary keys of each record prior to being inserted (which 
may or may not cause a significant performance problem, depending upon 
numerous factors).  However, one would still need to execute a DELETE 
statement on the original table(s) at some point to keep it to only a 
month's worth of data.  Unless this delete is executed in the same 
transaction batch as the command populating the archive tables, there is 
a possibility that (many) newly created records will be irreversibly 
deleted.  This is the precise scenario for which transactional databases 
were developed.

You have implicitly brought up a good point that it may make more sense 
to have your archive tables in a seperate database entirely.  My example 
was for the scenario where the user only has rights to 1 database on the 
server, although I now realise this is extremely unlikely among those 
who have installed mod_log_sql.

I would like to draw attention to the idea of offering support (possibly 
as a compile-time option) for using the database's mechanism for 
auto-incrementing big integers to provide a unique identifier for each 
hit rather than apache's mod_unique_id module.  The main challenges as I 
see it are:

    * providing an alternate schema to support this...
          o The BIGINT type is part of the SQL standard, and is thus
            supported by all reasonable DBMSes on all resonable
            platforms.  However, there are differences in the
            implementation.  For example, the optimal syntax in a create
            table statement for MySQL would be:
                + UNSIGNED BIGINT AUTO_INCREMENT PRIMARY KEY,
          o While the syntax to create a table in PGSQL is:
            (unfortunately PGSQL doesn't seem to support unsigned big
            integers [I am not a PGSQL expert])
                + BIGSERIAL PRIMARY KEY,
          o References:
                + http://www.postgresql.org/docs/8.0/interactive/datatype.html#DATATYPE-NUMERIC
                + http://www.postgresql.org/docs/7.4/interactive/datatype.html#DATATYPE-NUMERIC
                + http://dev.mysql.com/doc/mysql/en/numeric-types.html
    * modifying the INSERT INTO cookies, INSERT INTO headers_out, INSERT
      INTO headers_in, INSERT INTO notes statements to use the
      database's mechanism for each database.  This could be really
      challenging except for the convenient fact that Perl's DBI library
      provides the following function:
          o http://search.cpan.org/~timb/DBI/DBI.pm#last_insert_id

I would love to try to submit a patch for this if I had more experience 
with C, or felt more confident with doing this cleanly.  No promises 
though, and since my needs (a unique ID without installing 
mod_unique_id) are satisfied by the simple workaround described in the 
previous email, the changes should be done by somebody with the 
experience to do so cleanly, or not at all.

Maybe I'll package some scripts for creating and populating well-indexed 
archive tables, and some sample queries for use with www.phpmyadmin.org 
(or http://phppgadmin.org ) but if somebody else does it first, thank you!

Thanks,
Matt

Carl Edwards wrote:

>Hi Matt,
>
>Thanks for the info on indexes and archiving.  Instead of using
>explicit locks and emptying the whole table how about keeping a
>months worth of data in the current table and using something like:
>
>INSERT IGNORE INTO archive.access SELECT * from current.access;
>
>To keep the archive up to date?
>
>Thanks,
>Carl at sdrawdE.net
>
>  
>
>> -----Original Message-----
>> 
>> Today's Topics:
>> 
>>    1. Re: Table indexing (Matt Erbst)
>> 
>> 
>> ----------------------------------------------------------------------
>> 
>> Date: Fri, 03 Jun 2005 07:47:22 -0700
>> From: Matt Erbst <merbst at cox.net>
>> Subject: Re: [Mod_log_sql] Table indexing
>> 
>> When it comes to indexing, the answer is of course, "it 
>> depends".  From 
>> the perspective of a database engineer what we have here is an OLTP 
>> (On-Line Transaction Processing) system which we also wish 
>> to use as an 
>> OLAP (On-Line Analytical Processing) system.
>> 
>> mod_log_sql is recording Transactions (apache hits) in an 
>> On-Line manner 
>> (as they happen) to the database.  On heavily loaded sites 
>> it is writing 
>> upwards of 100 hits per second, 24H a day, every day.  If 
>> the option is 
>> enabled, mod_log_sql will be writing up to 20 cookies PER HIT to the 
>> cookies table, and multiple headers per hit to each of the 
>> header_in and 
>> header_out tables.  In other words the ratio of INSERTs to 
>> SELECTs will 
>> be HUGE, so any indexes will overall cost more time than 
>> they save.  If 
>> you wish to use mod_log_sql on a high-traffic server AND run 
>> queries on 
>> the data, it would be best to keep a mirror of these tables, with a 
>> script that runs once every day to move the contents over 
>> from the first 
>> table to the second set.
>> 
>> The set of tables being written to by mod_log_sql keeps the original 
>> names, but the copy can be named "access_log_archive" and 
>> should contain 
>> indexes.
>> 
>>     |-- If you are using MyISAM tables, or HEAP tables use 
>> the following
>>     line|| to create your archive tables (and clear out the 
>> originals)||:|
>>     |LOCK TABLES access_log, cookies, headers_in, 
>> headers_out, notes;|
>>     |-- Otherwise if you are using BDB or InnoDB tables use the
>>     following line to create your archive tables (and clear out the
>>     originals):|
>>     |BEGIN;|
>>     ||
>>     |CREATE TABLE access_log_archive SELECT * FROM access_log;|
>>     |||CREATE TABLE cookies_archive SELECT * FROM cookies;|
>>     |||CREATE TABLE headers_in_archive SELECT * FROM headers_in;|
>>     |||CREATE TABLE headers_out_archive SELECT * FROM headers_out;|
>>     |||CREATE TABLE notes_archive SELECT * FROM notes;|
>>     |||-- If you are using BDB or InnoDB tables use the 
>> following line:|
>>     ||COMMIT;
>>     |-- ||Otherwise ||if you are using MyISAM tables, or 
>> HEAP tables use
>>     the following line||||:|
>>     |UNLOCK TABLES ||access_log, cookies, headers_in, 
>> headers_out, notes;|
>>     ||
>> 
>> |To keep these tables populated, you should have your server 
>> run these 
>> scripts on a nightly basis: 
>> (http://dev.mysql.com/doc/mysql/en/insert-select.html)
>> |
>> 
>>     |LOCK TABLES access_log, cookies, headers_in, 
>> headers_out, notes;||
>>     -- If you are using MyISAM tables, or HEAP tables only|
>>     |BEGIN; -- for BDB or InnoDB tables only...|
>>     |INSERT INTO access_log_archive SELECT * FROM access_log;|
>>     |||INSERT INTO cookies_archive SELECT * FROM cookies;|
>>     |||INSERT INTO headers_in_archive SELECT * FROM headers_in;|
>>     |||INSERT INTO headers_out_archive SELECT * FROM headers_out;|
>>     |||INSERT INTO notes_archive SELECT * FROM notes;|
>>     ||DELETE FROM access_log; -- This is why transactions, 
>> or just table
>>     locking is important!  We don't want to delete the hits 
>> which were
>>     logged between copying them off into our archive tables and now!
>>     DELETE FROM headers_in;
>>     DELETE FROM headers_out;
>>     DELETE FROM cookies;
>>     DELETE FROM notes;
>> 
>>     ||||||COMMIT; -- for BDB or InnoDB tables only
>>     |UNLOCK TABLES ||access_log, cookies, headers_in, headers_out,
>>     notes;|| -- If you are using MyISAM tables, or HEAP tables only|
>>     ||||||
>>     ||
>> 
>> Indexes should be placed in the ARCHIVE tables only, on 
>> every column by 
>> which the results will be SORTED or in the WHERE clause.  The bare 
>> minimum indexes (which satisfy the most common queries one 
>> could have 
>> about their web-hits) which I use are:
>> 
>>     CREATE INDEX time_stamp ON access_log_archive (time_stamp);
>>     CREATE INDEX virtual_host ON access_log_archive (virtual_host);
>>     CREATE PRIMARY INDEX p_id ON access_log_archive (id); -- 
>> This index
>>     is only neccessary if you ever wish to manipulate an 
>> individual row
>>     (unlikely), or you are recording cookie and header data 
>> in the other
>>     tables, or if you are using mysql to create unique IDs 
>> (see below),
>>     otherwise it can be skipped
>>     -- *IF* you may want to do a lot of REFER[R]ER analysis, consider
>>     adding the following index:
>>         CREATE INDEX referrer ON access_log_archive (referer);
>>     -- IF you are curious WHICH files are being accessed the most per
>>     site, consider adding the following index:
>>         CREATE INDEX files_accessed ON access_log_archive 
>> (virtual_host,
>>     request_uri);
>>     -- IF you are curious to track WHICH IPs are visiting 
>> each site the
>>     most:
>>         CREATE INDEX biggest_fans ON access_log_archive 
>> (virtual_host,
>>     remote_host);
>> 
>> -- if you are using the other tables, you definately want to 
>> create an 
>> index on every column which references the ID column of access_log:
>> 
>>     CREATE INDEX id ON header_in_archive (id); -- Note: this index
>>     should NOT be primary!
>>     CREATE INDEX id ON header_out_archive (id); -- Note: this index
>>     should NOT be primary!
>>     CREATE INDEX id ON cookies_archive (id); -- Note: this 
>> index should
>>     NOT be primary!
>>     CREATE INDEX id ON notes_archive (id); -- Note: this index should
>>     NOT be primary!
>> 
>> These indexes allow for sub-millisecond response times on 
>> queries such as:
>> 
>>     SELECT * FROM access_logs_archive
>>     WHERE time_stamp BETWEEN |UNIX_TIMESTAMP('2005-05-01') AND
>>     UNIX_TIMESTAMP('2005-06-01')|
>>     |AND (virtual_host = "www.myfavoritedomain.com" OR
>>     "myfavoritedomain.com"); -- you could use virtual_host LIKE
>>     "%myfavoritedomain.com" instead, but this wouldn't take full
>>     advantage of the index on that column.|
>> 
>> 
>> Or my personal favorite query:
>> 
>>     SELECT YEAR(FROM_UNIXTIME(time_stamp)),
>>     MONTH(FROM_UNIXTIME(time_stamp)), DAY(FROM_UNIXTIME(time_stamp)),
>>     virtual_host,
>>          SUM(bytes_sent) AS bytes FROM access_log
>>     GROUP BY YEAR(FROM_UNIXTIME(time_stamp)),
>>     MONTH(FROM_UNIXTIME(time_stamp)), DAY(FROM_UNIXTIME(time_stamp)),
>>     virtual_host
>>     ORDER BY YEAR(FROM_UNIXTIME(time_stamp)),
>>     MONTH(FROM_UNIXTIME(time_stamp)),  
>> DAY(FROM_UNIXTIME(time_stamp)),
>>     virtual_host
>>     -- Alternatively: ORDER BY BYTES desc
>> 
>> I personally do not record anything to the cookies, headers, 
>> or notes 
>> tables, however I wanted the ID column to have integer ID 
>> numbers that 
>> always auto-increment, so I can refer to a large range of 
>> log-entries at 
>> once rapidly.  I found that mod_log_sql works flawlessly 
>> after I changed 
>> the type of the ID column to BIGINT PRIMARY KEY AUTO_INCREMENT.  The 
>> other advantages of this over using apache's mod_unique_id 
>> are: faster 
>> performance (mysql can figure out the next ID number in a 
>> PRIMARY KEY 
>> column very quickly), and less space being used (BIGINT 
>> takes 8 bytes, 
>> versus a 19 byte CHAR field being used to record the ID number from 
>> apache in each of 5 tables.  Don't forget that it does not 
>> take long for 
>> an active server to exceed 4 billion records.).  *I would strongly 
>> recommend that future implementations of mod_log_sql use mysql's 
>> AUTO_INCREMENTING integers feature, and join the other 
>> tables to each 
>> record using MySQL's LAST_INSERT_ID() function, or if using 
>> the C API, a 
>> faster method described here: 
>> http://dev.mysql.com/doc/mysql/en/getting-unique-id.html*
>> 
>> For more clever ideas, I recommend the book "High 
>> Performance MySQL" by 
>> O'Reilly publishing.
>> <http://www.amazon.com/exec/obidos/tg/detail/-/0596003064/>
>> Disclaimer: I have not personally tested any of the above 
>> code (except 
>> my favorite query), it is intended as a guideline of some 
>> good methods 
>> of minimizing server load while allowing fast transactional 
>> logging and 
>> fast queries to be written.  I also wrote this email in a 
>> hurry because 
>> I'm travelling to thailand VERY soon...
>> 
>> I'll try to follow up as soon as I have internet access again...
>> 
>> Thanks,
>> Matt
>> 
>> Edward Rudd wrote:
>> 
>> >On Tue, 2005-05-31 at 05:25, Justin Finkelstein wrote:
>> >  
>> >
>> >>Hi guys
>> >>
>> >>I've just had a look at my database and I've notice that nothing
>> >>appears to be logged for POST requests; does this need to be
>> >>explicitly turned on in the module?
>>    
>>
>_______________________________________________
>Download the latest version at http://www.outoforder.cc/projects/apache/mod_log_sql/
>
>To unsubscribe send an e-mail to 
>mod_log_sql-unsubscribe at lists.outoforder.cc
>
>  
>

-- 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.outoforder.cc/pipermail/mod_log_sql/attachments/20050611/fe803b73/attachment.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: C:\Documents and Settings\Administrator.MATT\My	Documents\My Pictures\smile.gif
Type: image/gif
Size: 55076 bytes
Desc: not available
Url : http://lists.outoforder.cc/pipermail/mod_log_sql/attachments/20050611/fe803b73/attachment.gif 


More information about the Mod_log_sql mailing list