Change Log
Release 0.9.1
- Added support for posting_has_expired? and expired post recognition
- Fixed a weird bug in craigwatch that would cause a scrape to abort if a
flagged_for_removal? was encountered when using certain (minimal) filtering
Release 0.9 (Oct 01, 2010)
- Minor adjustments to craigwatch to fix deprecation warnings in new
ActiveRecord and ActionMailer gems
- Added gem version specifiers to the Gem spec and to the require statements
- Moved repo to github
- Fixed an esoteric bug in craigwatch, affecting the last scraped post in a
listing when that post was ‘flagged for removal’.
- Took all those extra package-building tasts out of the Rakefile since this
is 2010 and we only party with gemfiles
- Ruby 1.9 compatibility adjustments
Release 0.8.4 (Sep 6, 2010)
- Someone found a way to screw up hpricot‘s to_s method
(posting1938291834-090610.html) and fixed by added html_source to the
craigslist Scraper object, which returns the body of the post without
passing it through hpricot. Its a better way to go anyways, and re-wrote a
couple incidentals to use the html_source method…
- Adjusted the test cases a bit, since the user bodies being returned have
less cleanup in their output than they had prior
Release 0.8.3 (August 2, 2010)
- Someone was posting really bad html that was screwing up Hpricot. Such is
to be expected when you‘re soliciting html from the general public I
suppose. Added test_bugs_found061710 posting test, and fixed by stripping
out the user body before parsing with Hpricot.
- Added a MaxRedirectError and corresponding maximum_redirects_per_request
cattr for the Craigscrape objects. This fixed a weird bug where craigslist
was sending us in redirect circles around 06/10
Release 0.8.2 (April 17, 2010)
- Found another odd parsing bug. Scrape sample is in
‘listing_samples/mia_search_kitten.3.15.10.html’, Adjusted
CraigScrape::Listings::HEADER_DATE to fix.
- Craigslist started added <span> tags in its post summaries. Fixed.
See sample in test_new_listing_span051710
Release 0.8.1 (Feb 10, 2010)
- Found an odd parsing bug occured for the first time today. Scrape sample is
in ‘listing_samples/mia_sss_kittens2.10.10.html’, Adjusted
CraigScrape::Listings::LABEL to fix.
- Switched to require "active_support" per the deprecation notices
- Little adjustments to fix the rdoc readibility
Release 0.8.0 (Oct 22, 2009)
- Lots of substantial changes to the API & craigwatch (though backwards
compatibility is mostly there)
- Added :code_tests to the rakefile
- Report definitions don‘t need a full path to the :dbfile, the
parameter here can now be relative to the yaml file itself
- Added a Listings::next_page
- craigwatch: When not specifying a regex in _has or _has_no, we now perform
an insensitive search
- Created CraigScrape::GeoListings.find_sites, &
CraigScrape::GeoListings.sites_in_path methods
- Large API changes Added a constructor to CraigScrape, and changed a number of
ways that sites are scraped
- Changed the format of the craigwatch tracking db - you‘ll need to
delete any db‘s you already have and let the migrations re-run
- craigwatch is much more efficient with memory. Feel free to scrape
the whole world now!
- craigwatch‘s yml changed a bit - documented in craigwatch
- We‘ll more or less automatically figure out the tracking_database in
craigwatch if none is specified (will default to sqlite and auto-generated
filename)
- craigwatch report_name is optional too now and can largely figure itself
out
- Added summary_or_full_post_has and summary_or_full_post_has_no as
craigwatch report parameters
- If a craigwatch search comes up empty - we now indicate that no results
were found…
- Added location_has, location_has_no to craigwatch
- Cleaned up thge rdoc to clarify all the new syntax/features
- Added Scraper::retries_on_404_fail, Scraper::sleep_between_404_retries to
help deal with some of the subtleties in handling connection reset errors
different than the 404‘s
Release 0.7.0 (Jul 5, 2009)
- A good bit of refactoring
- Eager-loading in the Post object without the need of the full_post method
- full_post is no longer needed or available
- Added a Base Scraper object. Maybe I‘ll make that its own gem…
- Post/Listing constructors now take either a Hash or a url (string) to use
for its scrape. Regardless, either should always have a url object
association now
- Removed the PostSummary object, now we just have Posting/ Posts include all
the functionality of both PostFull and PostSummary. Be careful with Posting
if you‘re trying to minimize bandwidth. the rdoc labels which methods
wont/will/might cause a page load.
- Things should fail better (always outputting relevant html that was unable
to be parsed)
- Removed Posting::date in favor of Posting::post_date
- Posting::full_url is now url
- Post::href is no longer needed. Use Post::uri.path instead
- Posting.images renamed to Posting.pics, and a new method Posting.images
better reflects the craigslist designations
- Fixed a bug with some search listings, where the last page might not get
scrapped b/c craigslist doesn‘t correctly include the next page link.
see ‘test_nasty_search_listings’ for an example
- On some listings (with the empty h4 tags) we were eager-loading when we
could have made some safe assumptions instead. Fixed.
- Preliminary support for GeoListing scrapes.. (Not exactly sure where this
will go - but I have ideas..)
- Adjusted Posting::images and Posting::pics to better match the craigslist
media-dicotomy notation.
Release 0.6.5 (Jun 8, 2009)
- Added PostFull::deleted_by_author? , added test case for said condition
- Fixed a bug that caused the library to die in weird ways if there
wasn‘t a title tag on a parsed page
- Apparently Craigslist starting gzip-encoding some listings. Added
gzip decoding support
- Found a bug when parsing the location field on some full_posts in the apa
sections
- Added support for file:// uri‘s int he scrape_* functions, and
revised the tests to use these uri‘s
- Fixed a bug that caused errors to be raised with legitimately empty listing
pages
Release 0.6.0 (May 21, 2009)
- Added PostFull::flagged_for_removal?
- Fixed a couple small parse bugs found in production
- Added a ParseError object, which is raised on … detected Parse
errors. This raise will output the buggy parse, please send this over to me
if you think CraigsScrape is actually wrong ..
- Adjusted the examples in the readme, added a "require
‘rubygems’" to the top of the listing so that they would
actually work if you tried to run them verbatim (Thanks J T!)
- Restructured some of the parsing to be less leinient when scraped values
aren‘t matching their regexp‘s in the PostSummary
- It seems like craigslist returns a 404 on pages that exist, for no good
reason on occasion. Added a retry mechanism that wont take no for an
answer, unless we get a defineable number of them in a row
- Added CraigScrape cattr_accessors
: retries_on_fetch_fail, sleep_between_fetch_retries .
- Adjusted craigwatch to not commit any database changes until the
notification email goes out. This way if there‘s an error, the user
wont miss any results on a re-run
- Added a FetchError for http requests that don‘t return 200 or
redirect…
- Adjusted craigwatch to use scrape_until instead of scrape_since, this new
approach cuts down on the url fetching by assuming that if we come across
something we‘ve already tracked, we dont need to keep going any
further. NOTE: We still can‘t use a ‘last_scraped_url’ on
the TrackedSearch model b/c sometimes posts get deleted.
- We‘re tracking all date-relevant posts now in craigwatch, not just
the ones that matched the first time aorund, this should significantly
speed up our scrapes
- Found an example of a screwy listing that was starting date h4‘s ,
but not actually listing the dates in them (see
mia_fua_index8900.5.21.09.html test case). Added test case - handled
appropriately. Seems like a bug in craigslist itself.
Release 0.5.0 (April 30, 2009)
- First release. Not much to say - hopefully someone else finds this useful.