[Mod_log_sql] Table indexing
Matt Erbst
merbst at cox.net
Fri Jun 3 09:47:22 EST 2005
When it comes to indexing, the answer is of course, "it depends". From
the perspective of a database engineer what we have here is an OLTP
(On-Line Transaction Processing) system which we also wish to use as an
OLAP (On-Line Analytical Processing) system.
mod_log_sql is recording Transactions (apache hits) in an On-Line manner
(as they happen) to the database. On heavily loaded sites it is writing
upwards of 100 hits per second, 24H a day, every day. If the option is
enabled, mod_log_sql will be writing up to 20 cookies PER HIT to the
cookies table, and multiple headers per hit to each of the header_in and
header_out tables. In other words the ratio of INSERTs to SELECTs will
be HUGE, so any indexes will overall cost more time than they save. If
you wish to use mod_log_sql on a high-traffic server AND run queries on
the data, it would be best to keep a mirror of these tables, with a
script that runs once every day to move the contents over from the first
table to the second set.
The set of tables being written to by mod_log_sql keeps the original
names, but the copy can be named "access_log_archive" and should contain
indexes.
|-- If you are using MyISAM tables, or HEAP tables use the following
line|| to create your archive tables (and clear out the originals)||:|
|LOCK TABLES access_log, cookies, headers_in, headers_out, notes;|
|-- Otherwise if you are using BDB or InnoDB tables use the
following line to create your archive tables (and clear out the
originals):|
|BEGIN;|
||
|CREATE TABLE access_log_archive SELECT * FROM access_log;|
|||CREATE TABLE cookies_archive SELECT * FROM cookies;|
|||CREATE TABLE headers_in_archive SELECT * FROM headers_in;|
|||CREATE TABLE headers_out_archive SELECT * FROM headers_out;|
|||CREATE TABLE notes_archive SELECT * FROM notes;|
|||-- If you are using BDB or InnoDB tables use the following line:|
||COMMIT;
|-- ||Otherwise ||if you are using MyISAM tables, or HEAP tables use
the following line||||:|
|UNLOCK TABLES ||access_log, cookies, headers_in, headers_out, notes;|
||
|To keep these tables populated, you should have your server run these
scripts on a nightly basis:
(http://dev.mysql.com/doc/mysql/en/insert-select.html)
|
|LOCK TABLES access_log, cookies, headers_in, headers_out, notes;||
-- If you are using MyISAM tables, or HEAP tables only|
|BEGIN; -- for BDB or InnoDB tables only...|
|INSERT INTO access_log_archive SELECT * FROM access_log;|
|||INSERT INTO cookies_archive SELECT * FROM cookies;|
|||INSERT INTO headers_in_archive SELECT * FROM headers_in;|
|||INSERT INTO headers_out_archive SELECT * FROM headers_out;|
|||INSERT INTO notes_archive SELECT * FROM notes;|
||DELETE FROM access_log; -- This is why transactions, or just table
locking is important! We don't want to delete the hits which were
logged between copying them off into our archive tables and now!
DELETE FROM headers_in;
DELETE FROM headers_out;
DELETE FROM cookies;
DELETE FROM notes;
||||||COMMIT; -- for BDB or InnoDB tables only
|UNLOCK TABLES ||access_log, cookies, headers_in, headers_out,
notes;|| -- If you are using MyISAM tables, or HEAP tables only|
||||||
||
Indexes should be placed in the ARCHIVE tables only, on every column by
which the results will be SORTED or in the WHERE clause. The bare
minimum indexes (which satisfy the most common queries one could have
about their web-hits) which I use are:
CREATE INDEX time_stamp ON access_log_archive (time_stamp);
CREATE INDEX virtual_host ON access_log_archive (virtual_host);
CREATE PRIMARY INDEX p_id ON access_log_archive (id); -- This index
is only neccessary if you ever wish to manipulate an individual row
(unlikely), or you are recording cookie and header data in the other
tables, or if you are using mysql to create unique IDs (see below),
otherwise it can be skipped
-- *IF* you may want to do a lot of REFER[R]ER analysis, consider
adding the following index:
CREATE INDEX referrer ON access_log_archive (referer);
-- IF you are curious WHICH files are being accessed the most per
site, consider adding the following index:
CREATE INDEX files_accessed ON access_log_archive (virtual_host,
request_uri);
-- IF you are curious to track WHICH IPs are visiting each site the
most:
CREATE INDEX biggest_fans ON access_log_archive (virtual_host,
remote_host);
-- if you are using the other tables, you definately want to create an
index on every column which references the ID column of access_log:
CREATE INDEX id ON header_in_archive (id); -- Note: this index
should NOT be primary!
CREATE INDEX id ON header_out_archive (id); -- Note: this index
should NOT be primary!
CREATE INDEX id ON cookies_archive (id); -- Note: this index should
NOT be primary!
CREATE INDEX id ON notes_archive (id); -- Note: this index should
NOT be primary!
These indexes allow for sub-millisecond response times on queries such as:
SELECT * FROM access_logs_archive
WHERE time_stamp BETWEEN |UNIX_TIMESTAMP('2005-05-01') AND
UNIX_TIMESTAMP('2005-06-01')|
|AND (virtual_host = "www.myfavoritedomain.com" OR
"myfavoritedomain.com"); -- you could use virtual_host LIKE
"%myfavoritedomain.com" instead, but this wouldn't take full
advantage of the index on that column.|
Or my personal favorite query:
SELECT YEAR(FROM_UNIXTIME(time_stamp)),
MONTH(FROM_UNIXTIME(time_stamp)), DAY(FROM_UNIXTIME(time_stamp)),
virtual_host,
SUM(bytes_sent) AS bytes FROM access_log
GROUP BY YEAR(FROM_UNIXTIME(time_stamp)),
MONTH(FROM_UNIXTIME(time_stamp)), DAY(FROM_UNIXTIME(time_stamp)),
virtual_host
ORDER BY YEAR(FROM_UNIXTIME(time_stamp)),
MONTH(FROM_UNIXTIME(time_stamp)), DAY(FROM_UNIXTIME(time_stamp)),
virtual_host
-- Alternatively: ORDER BY BYTES desc
I personally do not record anything to the cookies, headers, or notes
tables, however I wanted the ID column to have integer ID numbers that
always auto-increment, so I can refer to a large range of log-entries at
once rapidly. I found that mod_log_sql works flawlessly after I changed
the type of the ID column to BIGINT PRIMARY KEY AUTO_INCREMENT. The
other advantages of this over using apache's mod_unique_id are: faster
performance (mysql can figure out the next ID number in a PRIMARY KEY
column very quickly), and less space being used (BIGINT takes 8 bytes,
versus a 19 byte CHAR field being used to record the ID number from
apache in each of 5 tables. Don't forget that it does not take long for
an active server to exceed 4 billion records.). *I would strongly
recommend that future implementations of mod_log_sql use mysql's
AUTO_INCREMENTING integers feature, and join the other tables to each
record using MySQL's LAST_INSERT_ID() function, or if using the C API, a
faster method described here:
http://dev.mysql.com/doc/mysql/en/getting-unique-id.html*
For more clever ideas, I recommend the book "High Performance MySQL" by
O'Reilly publishing.
<http://www.amazon.com/exec/obidos/tg/detail/-/0596003064/>
Disclaimer: I have not personally tested any of the above code (except
my favorite query), it is intended as a guideline of some good methods
of minimizing server load while allowing fast transactional logging and
fast queries to be written. I also wrote this email in a hurry because
I'm travelling to thailand VERY soon...
I'll try to follow up as soon as I have internet access again...
Thanks,
Matt
Edward Rudd wrote:
>On Tue, 2005-05-31 at 05:25, Justin Finkelstein wrote:
>
>
>>Hi guys
>>
>>I've just had a look at my database and I've notice that nothing
>>appears to be logged for POST requests; does this need to be
>>explicitly turned on in the module?
>>
>>
>
>They should be logged and are so on my servers running 1.100. The
>actual data posted does not and will not be logged. but the requested
>URL will be.
>
>
>
>>Also, I use the log_mass_virtual_hosting option, and I've noticed that
>>this doesn't generate any indices on the tables - could I request an
>>option for creating indices on the fly for new tables? I'd be happy to
>>work on this if the project's open for other devs...
>>
>>
>
>The biggest issue with adding indexes (which I believe there is a bug
>report on issues.outoforder.cc about this) is which columns to index,
>and worrying about the performance hit for logging, as having indexes
>slows down the process of inserting data. I plan to add back in INSERT
>DELAYED for mysql in the next release which should relieve this issue
>some, but that will only help MySQL back-ends.
>
>If you are handy and knowledgeable w/ SQL index configuration and would
>like to test indexes and find a *good* fast set for the table that would
>help.
>
>And I'm always open to contributions, just send the patches to this list
>or attached to a bug in issues.outoforder.cc.
>
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Download the latest version at http://www.outoforder.cc/projects/apache/mod_log_sql/
>
>To unsubscribe send an e-mail to
>mod_log_sql-unsubscribe at lists.outoforder.cc
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.outoforder.cc/pipermail/mod_log_sql/attachments/20050603/378782a4/attachment.html
More information about the Mod_log_sql
mailing list