- WHY?
- HOW?
- Log analyzer
- Database
- Analytics UI
- License
- Download
WHY?
Our Squid proxies generate about 700 millions log entries monthly.
Analyzing that amount of data is quite time consuming, i.e. simple
grep lasts about 40 minutes.
So, what do we do?
We parse logs and feed them to database. 'Feeding' here means
aggregation, partitioning, indexing; entire process lasts less than
3 hours.
Once in the database, we can drill down by time, http response,
cache code, mime type... within seconds.
While that kind of statistics can be achieved with various reporting
tools, these don't let us see actual log entries.
But we can drill through - that is, get actual log entries that
satisfy criteria - within seconds, or up to a few minutes, depending
on range selected.
And maybe even more important, data mining provides us with
important insight into data: while drilling, some things become
obvious, things that admins rarely look for.
Things like, for example, bandwidth usage of 'are you alive' probes.
See, these probes affect your overall cache hit rate.
HOW?
System consists of three components:
Log analyzer
This is actual SquidCube code, which performs
- log parsing, including
- sanity checks
- URL verification and servername extraction
- aggregation: calculation
of sums of bytes and durations and counting individual log entries
by
- server
- time
- client IP
- HTTP method
- destination host
- cache code and subcode
- result code and subcode
- mime type and subtype
... and combinations of above, to improve lookup
performance
- partitioning: for
perfomance reasons, data is sliced in smaller pieces by
- date
- HTTP method
- CSV generation: saves
aggregates and partitions as CSV files in system temporary
directory.
- SQL generation: creates
SQL statements (PostgreSQL 9.2 syntax) to create paritions and
aggregates in system temporary directory.
It can feed the data on-line (equivalent of tail -f) or offline, in
a single table or partitioned, can generate only aggregates or only
fact tables, etc.
Run bin/squidcube.sh for
complete list of arguments.
Note: most of these use
cases will not be maintained and implementation may be broken.
Maintained are features we use: we use it only off-line, as we feed
the data on-demand. We always generate both facts and aggregates,
always partitioned.
Database
Our database of choice is PostgreSQL 9.2, for it's
- fast bulk loading using COPY command
- ability to load CSV files
- ability to turn off transactional logging for
some tables
PG 9.2 users:
- create squidlog database
- run psql
squidlog < sql/squidlog.sql to create tables and views
- run bin/loadcsv.sh
to load generated CSV files to the database.
PG earlier versions users may need to modify syntax of generated SQL
files, including removing some or all of:
- unlogged
- oids=false
- autovaccum enabled=false
- inherits
- check
Other database users:
- modify the SQL
- see how you can load CSV files
- add JDBC driver to lib subdir
- change connect() method in SquidLog.java
...and let us know!
After loading, don't forget to create indexes for your new
partitions and aggregates: run bin/indexes.sh.
Once done, it's already easy to get some important information, for
example
Request peak?
select * from agg_time order by count desc limit 1;
Traffic peak?
select * from agg_time order by bytes desc limit 1;
Top 10 hosts?
select * from agg_host order by count desc limit 10;
... and so on.
Analytics UI
Start Pentaho BI server by running biserver-ce/start-pentaho.sh.
Once started, go to localhost:8080, and login as 'joe' with password
'password' (pentaho defaults). Open SquidLog/squidlog.xaction, and click as much as you
like.
For all other information on Pentaho, start at http://community.pentaho.com/.
Here's a screenshot: