[Mod_log_sql] Table indexing

Matt Erbst merbst at cox.net
Fri Jun 3 09:47:22 EST 2005


When it comes to indexing, the answer is of course, "it depends".  From 
the perspective of a database engineer what we have here is an OLTP 
(On-Line Transaction Processing) system which we also wish to use as an 
OLAP (On-Line Analytical Processing) system.

mod_log_sql is recording Transactions (apache hits) in an On-Line manner 
(as they happen) to the database.  On heavily loaded sites it is writing 
upwards of 100 hits per second, 24H a day, every day.  If the option is 
enabled, mod_log_sql will be writing up to 20 cookies PER HIT to the 
cookies table, and multiple headers per hit to each of the header_in and 
header_out tables.  In other words the ratio of INSERTs to SELECTs will 
be HUGE, so any indexes will overall cost more time than they save.  If 
you wish to use mod_log_sql on a high-traffic server AND run queries on 
the data, it would be best to keep a mirror of these tables, with a 
script that runs once every day to move the contents over from the first 
table to the second set.

The set of tables being written to by mod_log_sql keeps the original 
names, but the copy can be named "access_log_archive" and should contain 
indexes.

    |-- If you are using MyISAM tables, or HEAP tables use the following
    line|| to create your archive tables (and clear out the originals)||:|
    |LOCK TABLES access_log, cookies, headers_in, headers_out, notes;|
    |-- Otherwise if you are using BDB or InnoDB tables use the
    following line to create your archive tables (and clear out the
    originals):|
    |BEGIN;|
    ||
    |CREATE TABLE access_log_archive SELECT * FROM access_log;|
    |||CREATE TABLE cookies_archive SELECT * FROM cookies;|
    |||CREATE TABLE headers_in_archive SELECT * FROM headers_in;|
    |||CREATE TABLE headers_out_archive SELECT * FROM headers_out;|
    |||CREATE TABLE notes_archive SELECT * FROM notes;|
    |||-- If you are using BDB or InnoDB tables use the following line:|
    ||COMMIT;
    |-- ||Otherwise ||if you are using MyISAM tables, or HEAP tables use
    the following line||||:|
    |UNLOCK TABLES ||access_log, cookies, headers_in, headers_out, notes;|
    ||

|To keep these tables populated, you should have your server run these 
scripts on a nightly basis: 
(http://dev.mysql.com/doc/mysql/en/insert-select.html)
|

    |LOCK TABLES access_log, cookies, headers_in, headers_out, notes;||
    -- If you are using MyISAM tables, or HEAP tables only|
    |BEGIN; -- for BDB or InnoDB tables only...|
    |INSERT INTO access_log_archive SELECT * FROM access_log;|
    |||INSERT INTO cookies_archive SELECT * FROM cookies;|
    |||INSERT INTO headers_in_archive SELECT * FROM headers_in;|
    |||INSERT INTO headers_out_archive SELECT * FROM headers_out;|
    |||INSERT INTO notes_archive SELECT * FROM notes;|
    ||DELETE FROM access_log; -- This is why transactions, or just table
    locking is important!  We don't want to delete the hits which were
    logged between copying them off into our archive tables and now!
    DELETE FROM headers_in;
    DELETE FROM headers_out;
    DELETE FROM cookies;
    DELETE FROM notes;

    ||||||COMMIT; -- for BDB or InnoDB tables only
    |UNLOCK TABLES ||access_log, cookies, headers_in, headers_out,
    notes;|| -- If you are using MyISAM tables, or HEAP tables only|
    ||||||
    ||

Indexes should be placed in the ARCHIVE tables only, on every column by 
which the results will be SORTED or in the WHERE clause.  The bare 
minimum indexes (which satisfy the most common queries one could have 
about their web-hits) which I use are:

    CREATE INDEX time_stamp ON access_log_archive (time_stamp);
    CREATE INDEX virtual_host ON access_log_archive (virtual_host);
    CREATE PRIMARY INDEX p_id ON access_log_archive (id); -- This index
    is only neccessary if you ever wish to manipulate an individual row
    (unlikely), or you are recording cookie and header data in the other
    tables, or if you are using mysql to create unique IDs (see below),
    otherwise it can be skipped
    -- *IF* you may want to do a lot of REFER[R]ER analysis, consider
    adding the following index:
        CREATE INDEX referrer ON access_log_archive (referer);
    -- IF you are curious WHICH files are being accessed the most per
    site, consider adding the following index:
        CREATE INDEX files_accessed ON access_log_archive (virtual_host,
    request_uri);
    -- IF you are curious to track WHICH IPs are visiting each site the
    most:
        CREATE INDEX biggest_fans ON access_log_archive (virtual_host,
    remote_host);

-- if you are using the other tables, you definately want to create an 
index on every column which references the ID column of access_log:

    CREATE INDEX id ON header_in_archive (id); -- Note: this index
    should NOT be primary!
    CREATE INDEX id ON header_out_archive (id); -- Note: this index
    should NOT be primary!
    CREATE INDEX id ON cookies_archive (id); -- Note: this index should
    NOT be primary!
    CREATE INDEX id ON notes_archive (id); -- Note: this index should
    NOT be primary!

These indexes allow for sub-millisecond response times on queries such as:

    SELECT * FROM access_logs_archive
    WHERE time_stamp BETWEEN |UNIX_TIMESTAMP('2005-05-01') AND
    UNIX_TIMESTAMP('2005-06-01')|
    |AND (virtual_host = "www.myfavoritedomain.com" OR
    "myfavoritedomain.com"); -- you could use virtual_host LIKE
    "%myfavoritedomain.com" instead, but this wouldn't take full
    advantage of the index on that column.|


Or my personal favorite query:

    SELECT YEAR(FROM_UNIXTIME(time_stamp)),
    MONTH(FROM_UNIXTIME(time_stamp)), DAY(FROM_UNIXTIME(time_stamp)),
    virtual_host,
         SUM(bytes_sent) AS bytes FROM access_log
    GROUP BY YEAR(FROM_UNIXTIME(time_stamp)),
    MONTH(FROM_UNIXTIME(time_stamp)), DAY(FROM_UNIXTIME(time_stamp)),
    virtual_host
    ORDER BY YEAR(FROM_UNIXTIME(time_stamp)),
    MONTH(FROM_UNIXTIME(time_stamp)),  DAY(FROM_UNIXTIME(time_stamp)),
    virtual_host
    -- Alternatively: ORDER BY BYTES desc

I personally do not record anything to the cookies, headers, or notes 
tables, however I wanted the ID column to have integer ID numbers that 
always auto-increment, so I can refer to a large range of log-entries at 
once rapidly.  I found that mod_log_sql works flawlessly after I changed 
the type of the ID column to BIGINT PRIMARY KEY AUTO_INCREMENT.  The 
other advantages of this over using apache's mod_unique_id are: faster 
performance (mysql can figure out the next ID number in a PRIMARY KEY 
column very quickly), and less space being used (BIGINT takes 8 bytes, 
versus a 19 byte CHAR field being used to record the ID number from 
apache in each of 5 tables.  Don't forget that it does not take long for 
an active server to exceed 4 billion records.).  *I would strongly 
recommend that future implementations of mod_log_sql use mysql's 
AUTO_INCREMENTING integers feature, and join the other tables to each 
record using MySQL's LAST_INSERT_ID() function, or if using the C API, a 
faster method described here: 
http://dev.mysql.com/doc/mysql/en/getting-unique-id.html*

For more clever ideas, I recommend the book "High Performance MySQL" by 
O'Reilly publishing.
<http://www.amazon.com/exec/obidos/tg/detail/-/0596003064/>
Disclaimer: I have not personally tested any of the above code (except 
my favorite query), it is intended as a guideline of some good methods 
of minimizing server load while allowing fast transactional logging and 
fast queries to be written.  I also wrote this email in a hurry because 
I'm travelling to thailand VERY soon...

I'll try to follow up as soon as I have internet access again...

Thanks,
Matt

Edward Rudd wrote:

>On Tue, 2005-05-31 at 05:25, Justin Finkelstein wrote:
>  
>
>>Hi guys
>>
>>I've just had a look at my database and I've notice that nothing
>>appears to be logged for POST requests; does this need to be
>>explicitly turned on in the module?
>>    
>>
>
>They should be logged and are so on my servers running 1.100.  The
>actual data posted does not and will not be logged. but the requested
>URL will be.
>
>  
>
>>Also, I use the log_mass_virtual_hosting option, and I've noticed that
>>this doesn't generate any indices on the tables - could I request an
>>option for creating indices on the fly for new tables? I'd be happy to
>>work on this if the project's open for other devs...
>>    
>>
>
>The biggest issue with adding indexes (which I believe there is a bug
>report on issues.outoforder.cc about this) is which columns to index,
>and worrying about the performance hit for logging, as having indexes
>slows down the process of inserting data.  I plan to add back in INSERT
>DELAYED for mysql in the next release which should relieve this issue
>some, but that will only help MySQL back-ends.
>
>If you are handy and knowledgeable w/ SQL index configuration and would
>like to test indexes and find a *good* fast set for the table that would
>help.
>
>And I'm always open to contributions, just send the patches to this list
>or attached to a bug in issues.outoforder.cc.
>
>  
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Download the latest version at http://www.outoforder.cc/projects/apache/mod_log_sql/
>
>To unsubscribe send an e-mail to 
>mod_log_sql-unsubscribe at lists.outoforder.cc
>  
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.outoforder.cc/pipermail/mod_log_sql/attachments/20050603/378782a4/attachment.html 


More information about the Mod_log_sql mailing list