Welcome Guest, you are in: Namespace

Sueetie Analytics - Log Filtering

RSS
Modified on 2010/09/21 16:17 by daveburke. Categorized as Feature Rich.
Sueetie Analytics Enhancement that filters unwanted urls and user agents from being logged and reported on.

A Clean Analytics Log is a Happy Analytics Log

Sueetie Analytics Logging was online for less than 24 hours when I knew we needed additional data and controls to ensure more accurate Analytics Reports. Bottom line, the logging algorithms needed to be more intelligent. We achieved that by adding User Agent and Remote IP logging, along with creating a url filter file to prevent unwanted urls from being logged.

User Agent Filtering

Sueetie Analytics are called USER Analytics, not crawler analytics. We do not wish to report on crawler page loads, so we need to prevent them from being logged. We could always do a post cleanup with a Sueetie Background Task, but it's more efficient to prevent crawler activity from being logged in the first place.

There is no static, defined list of crawler agents, so we're going to use a new SueetieConfiguration Core CrawlerAgents Property to manage our crawler agent list which will change over time. Here's what the initial list looks like in the Sueetie.config file. Expect it to change in short order.

 CrawlerAgents="(Reeder|msnbot|Googlebot|Baiduspider|ScrapeBox)"

Before logging the page request we perform a Regex() against the Sueetie Config CrawlerAgents value with the User Agent. If a match is found to indicate the agent is a Web Crawler we do not log the page load.

Url Filtering

There are certain application files we will not want to include in our Analytics Logs, pages like the auto refresh page performed by ScrewTurn Wiki when a user is editing the file. To manage which urls we do not wish to log we've added a NoLog.config file to the Sueetie /util/config directory. Here is the initial NoLog.config file where we are filtering the wiki refresh file.


<?xml version="1.0"?>
<nolog>

    <!-- Enter string returning true on Request.RawUrl.ToLowerInvariant().Contains(uniquePathExcerpt)-->

    <!-- wiki -->

    <url name="wiki_refresh"  uniquePathExcerpt="sessionrefresh.aspx" />

</nolog>


Additional Data Now Logged

To give more knowledge of site traffic, both by human and machine, we are logging two additional request properties: User Agent and Remote IP address. We are logging User Agent so we can learn what it hitting our site and what we need to enter into our CrawlerAgents string. Also, we are logging the Remote IP address. We need to know the origin of suspicious behavior so we can take steps to prevent future attacks. Not yet announced is a Sueetie Add-on Pack (online at Sueetie.com) which includes managed IP blocking. The data gathered here can be added to prevent site access.

Those of you who intimately know Sueetie and one of its core principles being a small footprint of keeping database size small may think logging the User Agent and IP address on each request breaks a major Sueetie Rule. I agree, so we've taken that into consideration in the design of the logging tables. We created a new table called Sueetie_RequestLog which stores User Agent and IP on each request. A Guid serves as a key value between this new RequestLog table and the ReportLog table. We use the Request Log for monitoring site activity and tweaking filtering, but it is not used in Analytics Reporting, so we can create a simple Sueetie Background Task to truncate this table periodically and keep our database size as small as possible.

New and Improved Results

With url and user agent filtering we ensure that our database stays small and our analytics reports are clean and accurate. By using the Sueetie Configuration "CrawlerAgents" property and NoLog.config file, we can manage Sueetie Analytics Log Filtering with ease.

ScrewTurn Wiki version 3.0.4.560.

Copyright © 2008-2012 Sueetie LLC. All rights reserved.
Sueetie