<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Well the reason for the explicit locks is to make sure no records are
deleted from the original table which have not been inserted into the
archive. (on a high traffic server, 100s of records could be written in
the time it takes to insert all those records into a new table)<br>
<br>
INSERT IGNORE would succeed at preventing duplicates, although only by
checking the primary keys of each record prior to being inserted (which
may or may not cause a significant performance problem, depending upon
numerous factors). However, one would still need to execute a DELETE
statement on the original table(s) at some point to keep it to only a
month's worth of data. Unless this delete is executed in the same
transaction batch as the command populating the archive tables, there
is a possibility that (many) newly created records will be irreversibly
deleted. This is the precise scenario for which transactional
databases were developed.<br>
<br>
You have implicitly brought up a good point that it may make more sense
to have your archive tables in a seperate database entirely. My
example was for the scenario where the user only has rights to 1
database on the server, although I now realise this is extremely
unlikely among those who have installed mod_log_sql.<br>
<br>
I would like to draw attention to the idea of offering support
(possibly as a compile-time option) for using the database's mechanism
for auto-incrementing big integers to provide a unique identifier for
each hit rather than apache's mod_unique_id module. The main
challenges as I see it are:<br>
<ul>
<li>providing an alternate schema to support this...</li>
<ul>
<li>The BIGINT type is part of the SQL standard, and is thus
supported by all reasonable DBMSes on all resonable platforms.
However, there are differences in the implementation. For example, the
optimal syntax in a create table statement for MySQL would be:</li>
</ul>
<ul>
<ul>
<li>UNSIGNED BIGINT AUTO_INCREMENT PRIMARY KEY,</li>
</ul>
</ul>
<ul>
<li>While the syntax to create a table in PGSQL is: (unfortunately
PGSQL doesn't seem to support unsigned big integers [I am not a PGSQL
expert])</li>
<ul>
<li>BIGSERIAL PRIMARY KEY,<br>
</li>
</ul>
</ul>
<ul>
<li>References:<br>
</li>
</ul>
<ul>
<ul>
<li><a class="moz-txt-link-freetext" href="http://www.postgresql.org/docs/8.0/interactive/datatype.html#DATATYPE-NUMERIC">http://www.postgresql.org/docs/8.0/interactive/datatype.html#DATATYPE-NUMERIC</a></li>
<li><a class="moz-txt-link-freetext" href="http://www.postgresql.org/docs/7.4/interactive/datatype.html#DATATYPE-NUMERIC">http://www.postgresql.org/docs/7.4/interactive/datatype.html#DATATYPE-NUMERIC</a></li>
<li><a class="moz-txt-link-freetext" href="http://dev.mysql.com/doc/mysql/en/numeric-types.html">http://dev.mysql.com/doc/mysql/en/numeric-types.html</a></li>
</ul>
</ul>
<li>modifying the INSERT INTO cookies, INSERT INTO headers_out,
INSERT INTO headers_in, INSERT INTO notes statements to use the
database's mechanism for each database. This could be really
challenging except for the convenient fact that Perl's DBI library
provides the following function:<br>
</li>
<ul>
<li><a class="moz-txt-link-freetext" href="http://search.cpan.org/~timb/DBI/DBI.pm#last_insert_id">http://search.cpan.org/~timb/DBI/DBI.pm#last_insert_id</a></li>
</ul>
</ul>
I would love to try to submit a patch for this if I had more experience
with C, or felt more confident with doing this cleanly. No promises
though, and since my needs (a unique ID without installing
mod_unique_id) are satisfied by the simple workaround described in the
previous email, the changes should be done by somebody with the
experience to do so cleanly, or not at all.<br>
<br>
Maybe I'll package some scripts for creating and populating
well-indexed archive tables, and some sample queries for use with
<a class="moz-txt-link-abbreviated" href="http://www.phpmyadmin.org">www.phpmyadmin.org</a> (or <a class="moz-txt-link-freetext" href="http://phppgadmin.org">http://phppgadmin.org</a> ) but if somebody else
does it first, thank you!<br>
<br>
Thanks,<br>
Matt<br>
<br>
Carl Edwards wrote:
<blockquote
cite="mid6E9FB6B656E007438A624FD5D7609C1D01EEABAC@MX-CAM.vsc.vitesse.com"
type="cite">
<pre wrap="">Hi Matt,
Thanks for the info on indexes and archiving. Instead of using
explicit locks and emptying the whole table how about keeping a
months worth of data in the current table and using something like:
INSERT IGNORE INTO archive.access SELECT * from current.access;
To keep the archive up to date?
Thanks,
<a class="moz-txt-link-abbreviated" href="mailto:Carl@sdrawdE.net">Carl@sdrawdE.net</a>
</pre>
<blockquote type="cite">
<pre wrap=""> -----Original Message-----
Today's Topics:
1. Re: Table indexing (Matt Erbst)
----------------------------------------------------------------------
Date: Fri, 03 Jun 2005 07:47:22 -0700
From: Matt Erbst <a class="moz-txt-link-rfc2396E" href="mailto:merbst@cox.net"><merbst@cox.net></a>
Subject: Re: [Mod_log_sql] Table indexing
When it comes to indexing, the answer is of course, "it
depends". From
the perspective of a database engineer what we have here is an OLTP
(On-Line Transaction Processing) system which we also wish
to use as an
OLAP (On-Line Analytical Processing) system.
mod_log_sql is recording Transactions (apache hits) in an
On-Line manner
(as they happen) to the database. On heavily loaded sites
it is writing
upwards of 100 hits per second, 24H a day, every day. If
the option is
enabled, mod_log_sql will be writing up to 20 cookies PER HIT to the
cookies table, and multiple headers per hit to each of the
header_in and
header_out tables. In other words the ratio of INSERTs to
SELECTs will
be HUGE, so any indexes will overall cost more time than
they save. If
you wish to use mod_log_sql on a high-traffic server AND run
queries on
the data, it would be best to keep a mirror of these tables, with a
script that runs once every day to move the contents over
from the first
table to the second set.
The set of tables being written to by mod_log_sql keeps the original
names, but the copy can be named "access_log_archive" and
should contain
indexes.
|-- If you are using MyISAM tables, or HEAP tables use
the following
line|| to create your archive tables (and clear out the
originals)||:|
|LOCK TABLES access_log, cookies, headers_in,
headers_out, notes;|
|-- Otherwise if you are using BDB or InnoDB tables use the
following line to create your archive tables (and clear out the
originals):|
|BEGIN;|
||
|CREATE TABLE access_log_archive SELECT * FROM access_log;|
|||CREATE TABLE cookies_archive SELECT * FROM cookies;|
|||CREATE TABLE headers_in_archive SELECT * FROM headers_in;|
|||CREATE TABLE headers_out_archive SELECT * FROM headers_out;|
|||CREATE TABLE notes_archive SELECT * FROM notes;|
|||-- If you are using BDB or InnoDB tables use the
following line:|
||COMMIT;
|-- ||Otherwise ||if you are using MyISAM tables, or
HEAP tables use
the following line||||:|
|UNLOCK TABLES ||access_log, cookies, headers_in,
headers_out, notes;|
||
|To keep these tables populated, you should have your server
run these
scripts on a nightly basis:
(<a class="moz-txt-link-freetext" href="http://dev.mysql.com/doc/mysql/en/insert-select.html">http://dev.mysql.com/doc/mysql/en/insert-select.html</a>)
|
|LOCK TABLES access_log, cookies, headers_in,
headers_out, notes;||
-- If you are using MyISAM tables, or HEAP tables only|
|BEGIN; -- for BDB or InnoDB tables only...|
|INSERT INTO access_log_archive SELECT * FROM access_log;|
|||INSERT INTO cookies_archive SELECT * FROM cookies;|
|||INSERT INTO headers_in_archive SELECT * FROM headers_in;|
|||INSERT INTO headers_out_archive SELECT * FROM headers_out;|
|||INSERT INTO notes_archive SELECT * FROM notes;|
||DELETE FROM access_log; -- This is why transactions,
or just table
locking is important! We don't want to delete the hits
which were
logged between copying them off into our archive tables and now!
DELETE FROM headers_in;
DELETE FROM headers_out;
DELETE FROM cookies;
DELETE FROM notes;
||||||COMMIT; -- for BDB or InnoDB tables only
|UNLOCK TABLES ||access_log, cookies, headers_in, headers_out,
notes;|| -- If you are using MyISAM tables, or HEAP tables only|
||||||
||
Indexes should be placed in the ARCHIVE tables only, on
every column by
which the results will be SORTED or in the WHERE clause. The bare
minimum indexes (which satisfy the most common queries one
could have
about their web-hits) which I use are:
CREATE INDEX time_stamp ON access_log_archive (time_stamp);
CREATE INDEX virtual_host ON access_log_archive (virtual_host);
CREATE PRIMARY INDEX p_id ON access_log_archive (id); --
This index
is only neccessary if you ever wish to manipulate an
individual row
(unlikely), or you are recording cookie and header data
in the other
tables, or if you are using mysql to create unique IDs
(see below),
otherwise it can be skipped
-- *IF* you may want to do a lot of REFER[R]ER analysis, consider
adding the following index:
CREATE INDEX referrer ON access_log_archive (referer);
-- IF you are curious WHICH files are being accessed the most per
site, consider adding the following index:
CREATE INDEX files_accessed ON access_log_archive
(virtual_host,
request_uri);
-- IF you are curious to track WHICH IPs are visiting
each site the
most:
CREATE INDEX biggest_fans ON access_log_archive
(virtual_host,
remote_host);
-- if you are using the other tables, you definately want to
create an
index on every column which references the ID column of access_log:
CREATE INDEX id ON header_in_archive (id); -- Note: this index
should NOT be primary!
CREATE INDEX id ON header_out_archive (id); -- Note: this index
should NOT be primary!
CREATE INDEX id ON cookies_archive (id); -- Note: this
index should
NOT be primary!
CREATE INDEX id ON notes_archive (id); -- Note: this index should
NOT be primary!
These indexes allow for sub-millisecond response times on
queries such as:
SELECT * FROM access_logs_archive
WHERE time_stamp BETWEEN |UNIX_TIMESTAMP('2005-05-01') AND
UNIX_TIMESTAMP('2005-06-01')|
|AND (virtual_host = "<a class="moz-txt-link-abbreviated" href="http://www.myfavoritedomain.com">www.myfavoritedomain.com</a>" OR
"myfavoritedomain.com"); -- you could use virtual_host LIKE
"%myfavoritedomain.com" instead, but this wouldn't take full
advantage of the index on that column.|
Or my personal favorite query:
SELECT YEAR(FROM_UNIXTIME(time_stamp)),
MONTH(FROM_UNIXTIME(time_stamp)), DAY(FROM_UNIXTIME(time_stamp)),
virtual_host,
SUM(bytes_sent) AS bytes FROM access_log
GROUP BY YEAR(FROM_UNIXTIME(time_stamp)),
MONTH(FROM_UNIXTIME(time_stamp)), DAY(FROM_UNIXTIME(time_stamp)),
virtual_host
ORDER BY YEAR(FROM_UNIXTIME(time_stamp)),
MONTH(FROM_UNIXTIME(time_stamp)),
DAY(FROM_UNIXTIME(time_stamp)),
virtual_host
-- Alternatively: ORDER BY BYTES desc
I personally do not record anything to the cookies, headers,
or notes
tables, however I wanted the ID column to have integer ID
numbers that
always auto-increment, so I can refer to a large range of
log-entries at
once rapidly. I found that mod_log_sql works flawlessly
after I changed
the type of the ID column to BIGINT PRIMARY KEY AUTO_INCREMENT. The
other advantages of this over using apache's mod_unique_id
are: faster
performance (mysql can figure out the next ID number in a
PRIMARY KEY
column very quickly), and less space being used (BIGINT
takes 8 bytes,
versus a 19 byte CHAR field being used to record the ID number from
apache in each of 5 tables. Don't forget that it does not
take long for
an active server to exceed 4 billion records.). *I would strongly
recommend that future implementations of mod_log_sql use mysql's
AUTO_INCREMENTING integers feature, and join the other
tables to each
record using MySQL's LAST_INSERT_ID() function, or if using
the C API, a
faster method described here:
<a class="moz-txt-link-freetext" href="http://dev.mysql.com/doc/mysql/en/getting-unique-id.html*">http://dev.mysql.com/doc/mysql/en/getting-unique-id.html*</a>
For more clever ideas, I recommend the book "High
Performance MySQL" by
O'Reilly publishing.
<a class="moz-txt-link-rfc2396E" href="http://www.amazon.com/exec/obidos/tg/detail/-/0596003064/"><http://www.amazon.com/exec/obidos/tg/detail/-/0596003064/></a>
Disclaimer: I have not personally tested any of the above
code (except
my favorite query), it is intended as a guideline of some
good methods
of minimizing server load while allowing fast transactional
logging and
fast queries to be written. I also wrote this email in a
hurry because
I'm travelling to thailand VERY soon...
I'll try to follow up as soon as I have internet access again...
Thanks,
Matt
Edward Rudd wrote:
>On Tue, 2005-05-31 at 05:25, Justin Finkelstein wrote:
>
>
>>Hi guys
>>
>>I've just had a look at my database and I've notice that nothing
>>appears to be logged for POST requests; does this need to be
>>explicitly turned on in the module?
</pre>
</blockquote>
<pre wrap=""><!---->_______________________________________________
Download the latest version at <a class="moz-txt-link-freetext" href="http://www.outoforder.cc/projects/apache/mod_log_sql/">http://www.outoforder.cc/projects/apache/mod_log_sql/</a>
To unsubscribe send an e-mail to
<a class="moz-txt-link-abbreviated" href="mailto:mod_log_sql-unsubscribe@lists.outoforder.cc">mod_log_sql-unsubscribe@lists.outoforder.cc</a>
</pre>
</blockquote>
<br>
<div class="moz-signature">-- <br>
<img src="cid:part1.03000703.04080600@cox.net" border="0"></div>
</body>
</html>