| Path: | bin/craigwatch |
| Last Update: | Sat Feb 05 20:54:33 +0000 2011 |
Created alongside the libcraigscrape library, libcraigwatch was designed to take the monotony out of regular craiglist monitoring. craigwatch is designed to be run at periodic intervals (hourly/daily/etc) through crontab and report all new postings within a listing or search url, since its last run, by email.
For more information, head to the craiglist monitoring help section of our website.
In additon to its report tracking, craigwatch offers many post search and filtering options that offer much imrpoved and more accurate results then does craigslist‘s search functions. Post filtering options include:
Multiple searches can be combined into a single report, and results can be sorted by newest-first or oldest-first (default)
Reporting output is easily customized html, handled by ActionMailer, and emails can be delivered via smtp or sendmail. Database tracking of already-delivered posts is handled by ActiveRecord, and its driver-agnostic SQL supports all the major backends (sqllite/mysql/postgres/probably-all-others). Database sizes are contained by automatically pruning old results that are no longer required at the end of each run.
Pretty useful, no?
craigwatch is coupled with libcraigscrape, and is installed via ruby gems. However, since we focused on keeping the libcraigscrape download ‘lightweight’ some additional gems need to be installed in addition to the initial libcraigscrape gem itself.
This should take care of the craigwatch install on all systems:
sudo gem install libcraigscrape kwalify activerecord actionmailer
Alternatively, if you‘ve already installed libcraigscrape and want to start working with craigwatch:
sudo gem install kwalify activerecord actionmailer
This script was initially developed with activerecord 2.3, actionmailer 2.3 and kwalify 0.7, but will likely work with most prior and future versions of these libraries.
When craigwatch is invoked, it is designed to run a single report and then terminate. There is only one parameter to craigwatch, and this parameter is the path to a valid report-definition yml file. ie:
craigwatch johns_daily_watch.yml
There is an included kwalify schema which can validate your definition files, but craigwatch will automatically do so at startup. Probably, the best way to understand the report definition files, is to look at the annotated sample file below, and use it as a starting point for your own.
By default there is no program output, however, setting any of the following paramters to ‘yes’ in your definition file will turn on useful debugging/logging output:
Let‘s start with a minimal report, just enough needed to get something quick working:
# We need some kind of destination to send this to
email_to: Chris DeRose <cderose@derosetechnologies.com>
# This is an array of specific 'searches' we'll be performing in this report:
searches:
# We're looking for 90's era cadillac, something cheap, confortable and in white...
- name: 90's White/Creme Convertible Cadillacs
# This starting date is mostly for the first run, and gives us a reasonable cut-off point from whcih to build.
# Its optional, and if omitted, craigwatch defaults to 'yesterday'
starting: 9/10/09
# We want to check all the labels, and filter out years not in the 90's, and cars not made by cadillac
summary_post_has:
- /(?:^|[^\d]|19)9[\d](?:[^\dk]|$)/i
- /cadillac/i
# I said we're looking for something *comfortable* !
summary_post_has_no: [ /xlr/i ]
# We were convertable, and white/cream/etc:
full_post_has:
- /convertible/i
- /(white|yellow|banana|creme|cream)/i
# Convertible - not *simulated* convertible!
full_post_has_no:
- /simulated[^a-z]{0,2}convertible/i
# We want to search all of craigslist's in the us, and we'll want to find it using
# the '/search/cta?hasPic=1&query=cadillac' url on the site
sites: [ us ]
listings:
- /search/cta?hasPic=1&query=cadillac
Here‘s another annotated report which uses most of the other available craigwatch features:
# The report_name is fed into Time.now.strftime, hence the formatting characters
report_name: Craig Watch For Johnathan on %D at %I:%M %p
email_to: Johnathan Peabody <john@example.local>
# This is sent straight into ActiveRecord, so there's plenty of options available here. the following is an easy
# default sqlite store that should work on most any system with a minimal overhead
tracking_database: { adapter: sqlite3, dbfile: /home/john/john_cwatch_report.db }
searches:
# Search #1:
- name: Schwinn Bikes For Sale in/near New York
starting: 9/10/2009
# Scrape the following sites/servers:
sites: [ us/ny/newyork, us/nj/southjersey ]
# Scrape the following listings pages:
listings: [ bik ]
# We want listings with Schwinn in the summary
summary_post_has: [ /schwinn/i ]
# We're only interested in adult bikes, so scrap any results that mentions chidren or kids
full_post_has_no: [ /(children|kids)/i ]
# Oh, and we're on a budget:
price_less_than: 120
# Search #2
- name: Large apartment rentals in San Francisco
sites: [ us/ca/sfbay ]
starting: 9/10/2009
# We're going to rely on craigslist's built-in search for this one since there's a lot of listings, and we
# want to conserve some bandwidth
listings: [ /search/apa?query=pool&minAsk=min&maxAsk=max&bedrooms=5 ]
# We'll require a price to be listed, 'cause it keeps out some of the unwanted fluff
price_required: yes
# Hopefully this will keep us away from a bad part of town:
price_greater_than: 1000
# Since we dont have time to driv to each location, we'll require only listings with pictures
has_image: yes
See COPYING