WHY?
HOW?
License
Download

WHY?

Our Squid proxies generate about 700 millions log entries monthly.
Analyzing that amount of data is quite time consuming, i.e. simple grep lasts about 40 minutes.
So, what do we do?
We parse logs and feed them to database. 'Feeding' here means aggregation, partitioning, indexing; entire process lasts less than 3 hours.
Once in the database, we can drill down by time, http response, cache code, mime type... within seconds.

While that kind of statistics can be achieved with various reporting tools, these don't let us see actual log entries.
But we can drill through - that is, get actual log entries that satisfy criteria - within seconds, or up to a few minutes, depending on range selected.

And maybe even more important, data mining provides us with important insight into data: while drilling, some things become obvious, things that admins rarely look for.
Things like, for example, bandwidth usage of 'are you alive' probes. See, these probes affect your overall cache hit rate.

HOW?

System consists of three components:

Log analyzer

This is actual SquidCube code, which performs

- log parsing, including
    - sanity checks
    - URL verification and servername extraction

- aggregation: calculation of sums of bytes and durations and counting individual log entries by
    - server
    - time
    - client IP
    - HTTP method
    - destination host
    - cache code and subcode
    - result code and subcode
    - mime type and subtype
    ... and combinations of above, to improve lookup performance

- partitioning: for perfomance reasons, data is sliced in smaller pieces by
    - date
    - HTTP method

- CSV generation: saves aggregates and partitions as CSV files in system temporary directory.

- SQL generation: creates SQL statements (PostgreSQL 9.2 syntax) to create paritions and aggregates in system temporary directory.

It can feed the data on-line (equivalent of tail -f) or offline, in a single table or partitioned, can generate only aggregates or only fact tables, etc.
Run bin/squidcube.sh for complete list of arguments.

Note: most of these use cases will not be maintained and implementation may be broken.
Maintained are features we use: we use it only off-line, as we feed the data on-demand. We always generate both facts and aggregates, always partitioned.

Database

Our database of choice is PostgreSQL 9.2, for it's
    - fast bulk loading using COPY command
    - ability to load CSV files
    - ability to turn off transactional logging for some tables

PG 9.2 users:
    - create squidlog database
    - run psql squidlog < sql/squidlog.sql to create tables and views
    - run bin/loadcsv.sh to load generated CSV files to the database.

PG earlier versions users may need to modify syntax of generated SQL files, including removing some or all of:
    - unlogged
    - oids=false
    - autovaccum enabled=false
    - inherits
    - check

Other database users:
    - modify the SQL
    - see how you can load CSV files
    - add JDBC driver to lib subdir
    - change connect() method in SquidLog.java
    ...and let us know!

After loading, don't forget to create indexes for your new partitions and aggregates: run bin/indexes.sh.

Once done, it's already easy to get some important information, for example

Request peak?
select * from agg_time order by count desc limit 1;

Traffic peak?



      select * from agg_time order by bytes desc limit 1;

Top 10 hosts?
select * from agg_host order by count desc limit 10;

... and so on.

Analytics UI

Start Pentaho BI server by running biserver-ce/start-pentaho.sh. Once started, go to localhost:8080, and login as 'joe' with password 'password' (pentaho defaults). Open SquidLog/squidlog.xaction, and click as much as you like.

For all other information on Pentaho, start at http://community.pentaho.com/.

Here's a screenshot:

License

Software: GPL
Images: Sasa Zec, Creative Commons Attribution-Share Alike

Download

Download from SourceForge project page.