CHANGELOG

Path: CHANGELOG
Last Update: Sat Feb 05 20:54:33 +0000 2011

Change Log

Release 0.9.1

  • Added support for posting_has_expired? and expired post recognition
  • Fixed a weird bug in craigwatch that would cause a scrape to abort if a flagged_for_removal? was encountered when using certain (minimal) filtering

Release 0.9 (Oct 01, 2010)

  • Minor adjustments to craigwatch to fix deprecation warnings in new ActiveRecord and ActionMailer gems
  • Added gem version specifiers to the Gem spec and to the require statements
  • Moved repo to github
  • Fixed an esoteric bug in craigwatch, affecting the last scraped post in a listing when that post was ‘flagged for removal’.
  • Took all those extra package-building tasts out of the Rakefile since this is 2010 and we only party with gemfiles
  • Ruby 1.9 compatibility adjustments

Release 0.8.4 (Sep 6, 2010)

  • Someone found a way to screw up hpricot‘s to_s method (posting1938291834-090610.html) and fixed by added html_source to the craigslist Scraper object, which returns the body of the post without passing it through hpricot. Its a better way to go anyways, and re-wrote a couple incidentals to use the html_source method…
  • Adjusted the test cases a bit, since the user bodies being returned have less cleanup in their output than they had prior

Release 0.8.3 (August 2, 2010)

  • Someone was posting really bad html that was screwing up Hpricot. Such is to be expected when you‘re soliciting html from the general public I suppose. Added test_bugs_found061710 posting test, and fixed by stripping out the user body before parsing with Hpricot.
  • Added a MaxRedirectError and corresponding maximum_redirects_per_request cattr for the Craigscrape objects. This fixed a weird bug where craigslist was sending us in redirect circles around 06/10

Release 0.8.2 (April 17, 2010)

  • Found another odd parsing bug. Scrape sample is in ‘listing_samples/mia_search_kitten.3.15.10.html’, Adjusted CraigScrape::Listings::HEADER_DATE to fix.
  • Craigslist started added <span> tags in its post summaries. Fixed. See sample in test_new_listing_span051710

Release 0.8.1 (Feb 10, 2010)

  • Found an odd parsing bug occured for the first time today. Scrape sample is in ‘listing_samples/mia_sss_kittens2.10.10.html’, Adjusted CraigScrape::Listings::LABEL to fix.
  • Switched to require "active_support" per the deprecation notices
  • Little adjustments to fix the rdoc readibility

Release 0.8.0 (Oct 22, 2009)

  • Lots of substantial changes to the API & craigwatch (though backwards compatibility is mostly there)
  • Added :code_tests to the rakefile
  • Report definitions don‘t need a full path to the :dbfile, the parameter here can now be relative to the yaml file itself
  • Added a Listings::next_page
  • craigwatch: When not specifying a regex in _has or _has_no, we now perform an insensitive search
  • Created CraigScrape::GeoListings.find_sites, & CraigScrape::GeoListings.sites_in_path methods
  • Large API changes Added a constructor to CraigScrape, and changed a number of ways that sites are scraped
  • Changed the format of the craigwatch tracking db - you‘ll need to delete any db‘s you already have and let the migrations re-run
  • craigwatch is much more efficient with memory. Feel free to scrape the whole world now!
  • craigwatch‘s yml changed a bit - documented in craigwatch
  • We‘ll more or less automatically figure out the tracking_database in craigwatch if none is specified (will default to sqlite and auto-generated filename)
  • craigwatch report_name is optional too now and can largely figure itself out
  • Added summary_or_full_post_has and summary_or_full_post_has_no as craigwatch report parameters
  • If a craigwatch search comes up empty - we now indicate that no results were found…
  • Added location_has, location_has_no to craigwatch
  • Cleaned up thge rdoc to clarify all the new syntax/features
  • Added Scraper::retries_on_404_fail, Scraper::sleep_between_404_retries to help deal with some of the subtleties in handling connection reset errors different than the 404‘s

Release 0.7.0 (Jul 5, 2009)

  • A good bit of refactoring
  • Eager-loading in the Post object without the need of the full_post method
  • full_post is no longer needed or available
  • Added a Base Scraper object. Maybe I‘ll make that its own gem…
  • Post/Listing constructors now take either a Hash or a url (string) to use for its scrape. Regardless, either should always have a url object association now
  • Removed the PostSummary object, now we just have Posting/ Posts include all the functionality of both PostFull and PostSummary. Be careful with Posting if you‘re trying to minimize bandwidth. the rdoc labels which methods wont/will/might cause a page load.
  • Things should fail better (always outputting relevant html that was unable to be parsed)
  • Removed Posting::date in favor of Posting::post_date
  • Posting::full_url is now url
  • Post::href is no longer needed. Use Post::uri.path instead
  • Posting.images renamed to Posting.pics, and a new method Posting.images better reflects the craigslist designations
  • Fixed a bug with some search listings, where the last page might not get scrapped b/c craigslist doesn‘t correctly include the next page link. see ‘test_nasty_search_listings’ for an example
  • On some listings (with the empty h4 tags) we were eager-loading when we could have made some safe assumptions instead. Fixed.
  • Preliminary support for GeoListing scrapes.. (Not exactly sure where this will go - but I have ideas..)
  • Adjusted Posting::images and Posting::pics to better match the craigslist media-dicotomy notation.

Release 0.6.5 (Jun 8, 2009)

  • Added PostFull::deleted_by_author? , added test case for said condition
  • Fixed a bug that caused the library to die in weird ways if there wasn‘t a title tag on a parsed page
  • Apparently Craigslist starting gzip-encoding some listings. Added gzip decoding support
  • Found a bug when parsing the location field on some full_posts in the apa sections
  • Added support for file:// uri‘s int he scrape_* functions, and revised the tests to use these uri‘s
  • Fixed a bug that caused errors to be raised with legitimately empty listing pages

Release 0.6.0 (May 21, 2009)

  • Added PostFull::flagged_for_removal?
  • Fixed a couple small parse bugs found in production
  • Added a ParseError object, which is raised on … detected Parse errors. This raise will output the buggy parse, please send this over to me if you think CraigsScrape is actually wrong ..
  • Adjusted the examples in the readme, added a "require ‘rubygems’" to the top of the listing so that they would actually work if you tried to run them verbatim (Thanks J T!)
  • Restructured some of the parsing to be less leinient when scraped values aren‘t matching their regexp‘s in the PostSummary
  • It seems like craigslist returns a 404 on pages that exist, for no good reason on occasion. Added a retry mechanism that wont take no for an answer, unless we get a defineable number of them in a row
  • Added CraigScrape cattr_accessors : retries_on_fetch_fail, sleep_between_fetch_retries .
  • Adjusted craigwatch to not commit any database changes until the notification email goes out. This way if there‘s an error, the user wont miss any results on a re-run
  • Added a FetchError for http requests that don‘t return 200 or redirect…
  • Adjusted craigwatch to use scrape_until instead of scrape_since, this new approach cuts down on the url fetching by assuming that if we come across something we‘ve already tracked, we dont need to keep going any further. NOTE: We still can‘t use a ‘last_scraped_url’ on the TrackedSearch model b/c sometimes posts get deleted.
  • We‘re tracking all date-relevant posts now in craigwatch, not just the ones that matched the first time aorund, this should significantly speed up our scrapes
  • Found an example of a screwy listing that was starting date h4‘s , but not actually listing the dates in them (see mia_fua_index8900.5.21.09.html test case). Added test case - handled appropriately. Seems like a bug in craigslist itself.

Release 0.5.0 (April 30, 2009)

  • First release. Not much to say - hopefully someone else finds this useful.

[Validate]