craigwatch

Path: bin/craigwatch
Last Update: Sat Feb 05 20:54:33 +0000 2011

craigwatch - A email-based "post monitoring" solution

Created alongside the libcraigscrape library, libcraigwatch was designed to take the monotony out of regular craiglist monitoring. craigwatch is designed to be run at periodic intervals (hourly/daily/etc) through crontab and report all new postings within a listing or search url, since its last run, by email.

For more information, head to the craiglist monitoring help section of our website.

Features

In additon to its report tracking, craigwatch offers many post search and filtering options that offer much imrpoved and more accurate results then does craigslist‘s search functions. Post filtering options include:

  • has_image - yes/no
  • price_required - yes/no
  • price_greater_than - (int)
  • price_less_than - (int)
  • full_post_has - (array of string or regexp) Only post whose full-post‘s contents contains/matches
  • full_post_has_no - (array of string or regexp) Only post whose full-post‘s contents_ contains doesn‘t contain/match
  • summary_post_has - (array of string or regexp) Only post whose listing‘s label contains/matches
  • summary_post_has_no - (array of string or regexp) Only post whose listing‘s label doesn‘t contain/match
  • summary_or_full_post_has - (array of string or regexp) Filter‘s out results which don‘t match either the post label or the post contents
  • summary_or_full_post_has_no - (array of string or regexp) Filter‘s out results which match either the post label or the post contents
  • location_has - (array of string or regexp) Only include posts which match against the post location
  • location_has_no - (array of string or regexp) Only include posts which don‘t match against the post location

Multiple searches can be combined into a single report, and results can be sorted by newest-first or oldest-first (default)

Reporting output is easily customized html, handled by ActionMailer, and emails can be delivered via smtp or sendmail. Database tracking of already-delivered posts is handled by ActiveRecord, and its driver-agnostic SQL supports all the major backends (sqllite/mysql/postgres/probably-all-others). Database sizes are contained by automatically pruning old results that are no longer required at the end of each run.

Pretty useful, no?

Installation

craigwatch is coupled with libcraigscrape, and is installed via ruby gems. However, since we focused on keeping the libcraigscrape download ‘lightweight’ some additional gems need to be installed in addition to the initial libcraigscrape gem itself.

This should take care of the craigwatch install on all systems:

   sudo gem install libcraigscrape kwalify activerecord actionmailer

Alternatively, if you‘ve already installed libcraigscrape and want to start working with craigwatch:

   sudo gem install kwalify activerecord actionmailer

This script was initially developed with activerecord 2.3, actionmailer 2.3 and kwalify 0.7, but will likely work with most prior and future versions of these libraries.

Usage

When craigwatch is invoked, it is designed to run a single report and then terminate. There is only one parameter to craigwatch, and this parameter is the path to a valid report-definition yml file. ie:

   craigwatch johns_daily_watch.yml

There is an included kwalify schema which can validate your definition files, but craigwatch will automatically do so at startup. Probably, the best way to understand the report definition files, is to look at the annotated sample file below, and use it as a starting point for your own.

By default there is no program output, however, setting any of the following paramters to ‘yes’ in your definition file will turn on useful debugging/logging output:

  • debug_database
  • debug_mailer
  • debug_craigscrape

Definition File Sample

Let‘s start with a minimal report, just enough needed to get something quick working:

   # We need some kind of destination to send this to
   email_to: Chris DeRose <cderose@derosetechnologies.com>

   # This is an array of specific 'searches' we'll be performing in this report:
   searches:
        # We're looking for 90's era cadillac, something cheap, confortable and in white...
      - name: 90's White/Creme Convertible Cadillacs

        # This starting date is mostly for the first run, and gives us a reasonable cut-off point from whcih to build.
        # Its optional, and if omitted, craigwatch defaults to 'yesterday'
        starting: 9/10/09

        # We want to check all the labels, and filter out years not in the 90's, and cars not made by cadillac
        summary_post_has:
           - /(?:^|[^\d]|19)9[\d](?:[^\dk]|$)/i
           - /cadillac/i

        # I said we're looking for something *comfortable* !
        summary_post_has_no: [ /xlr/i ]

        # We were convertable, and white/cream/etc:
        full_post_has:
           - /convertible/i
           - /(white|yellow|banana|creme|cream)/i

        # Convertible - not *simulated* convertible!
        full_post_has_no:
           - /simulated[^a-z]{0,2}convertible/i

        # We want to search all of craigslist's in the us, and we'll want to find it using
        # the '/search/cta?hasPic=1&query=cadillac' url on the site
        sites: [ us ]
        listings:
           - /search/cta?hasPic=1&query=cadillac

Here‘s another annotated report which uses most of the other available craigwatch features:

   # The report_name is fed into Time.now.strftime, hence the formatting characters
   report_name: Craig Watch For Johnathan on %D at %I:%M %p

   email_to: Johnathan Peabody <john@example.local>

   # This is sent straight into ActiveRecord, so there's plenty of options available here. the following is an easy
   # default sqlite store that should work on most any system with a minimal overhead
   tracking_database: { adapter: sqlite3, dbfile: /home/john/john_cwatch_report.db }

   searches:
      # Search #1:
      - name: Schwinn Bikes For Sale in/near New York
        starting: 9/10/2009

        # Scrape the following sites/servers:
        sites: [ us/ny/newyork, us/nj/southjersey ]

        # Scrape the following listings pages:
        listings: [ bik ]

        # We want listings with Schwinn in the summary
        summary_post_has: [ /schwinn/i ]

        # We're only interested in adult bikes, so scrap any results that mentions chidren or kids
        full_post_has_no: [ /(children|kids)/i ]

        # Oh, and we're on a budget:
        price_less_than: 120

      # Search #2
      - name: Large apartment rentals in San Francisco
        sites: [ us/ca/sfbay ]
        starting: 9/10/2009

        # We're going to rely on craigslist's built-in search for this one since there's a lot of listings, and we
        # want to conserve some bandwidth
        listings: [ /search/apa?query=pool&minAsk=min&maxAsk=max&bedrooms=5 ]

        # We'll require a price to be listed, 'cause it keeps out some of the unwanted fluff
        price_required: yes

        # Hopefully this will keep us away from a bad part of town:
        price_greater_than: 1000

        # Since we dont have time to driv to each location, we'll require only listings with pictures
        has_image: yes

Author

License

See COPYING

Required files

rubygems   kwalify   active_record   action_mailer   kwalify/util/hashlike   libcraigscrape   socket  

[Validate]