Class CraigScrape::GeoListings
In: lib/geo_listings.rb
Parent: Scraper

GeoListings represents a parsed Craigslist geo lisiting page. (i.e. 'http://geo.craigslist.org/iso/us') These list all the craigslist sites in a given region.

Methods

find_sites   location   new   sites   sites_in_path  

Constants

GEOLISTING_BASE_URL = %{http://geo.craigslist.org/iso/}
LOCATION_NAME = /[ ]*\>[ ](.+)[ ]*/
PATH_SCANNER = /(?:\\\/|[^\/])+/
URL_HOST_PART = /^[^\:]+\:\/\/([^\/]+)[\/]?$/
SITE_PREFIX = /^([^\.]+)/
FIND_SITES_PARTS = /^[ ]*([\+|\-]?)[ ]*(.+)[ ]*/

Public Class methods

find_sites takes a single array of strings as an argument. Each string is to be either a location path (see sites_in_path), or a full site (in canonical form - ie "memphis.craigslist.org"). Optionally, each of this may/should contain a ’+’ or ’-’ prefix to indicate whether the string is supposed to include sites from the master list, or remove them from the list. If no ’+’ or’-’ is specified, the default assumption is ’+’. Strings are processed from left to right, which gives a high degree of control over the selection set. Examples:

There‘s a lot of flexibility here, you get the idea.

[Source]

# File lib/geo_listings.rb, line 123
    def self.find_sites(specs, base_url = GEOLISTING_BASE_URL)
      ret = []
      
      specs.each do |spec|
        (op,spec = $1,$2) if FIND_SITES_PARTS.match spec

        spec = (spec.include? '.')  ? [spec] : sites_in_path(spec, base_url) 

        (op == '-') ? ret -= spec : ret |= spec
      end
      
      ret
    end

The geolisting constructor works like all other Scraper objects, in that it accepts a string ‘url’. See the Craigscrape.find_sites for a more powerful way to find craigslist sites.

[Source]

# File lib/geo_listings.rb, line 28
    def initialize(init_via = nil)
      super(init_via)

      # Validate that required fields are present, at least - if we've downloaded it from a url
      parse_error! unless location
    end

This method will return an array of all possible sites that match the specified location path. Sample location paths:

  • us/ca
  • us/fl/miami
  • jp/fukuoka
  • mx

Here‘s how location paths work.

  • The components of the path are to be separated by ’/’ ‘s.
  • Up to (and optionally, not including) the last component, the path should correspond against a valid GeoLocation url with the prefix of ‘geo.craigslist.org/iso/’
  • the last component can either be a site‘s ‘prefix’ on a GeoLocation page, or, the last component can just be a geolocation page itself, in which case all the sites on that page are selected.
  • the site prefix is the first dns record in a website listed on a GeoLocation page. (So, for the case of us/fl/miami , the last ‘miami’ corresponds to the ‘south florida’ link on 'http://geo.craigslist.org/iso/us/fl'

[Source]

# File lib/geo_listings.rb, line 70
    def self.sites_in_path(full_path, base_url = GEOLISTING_BASE_URL)
      # the base_url parameter is mostly so we can test this method
      
      # Unfortunately - the easiest way to understand much of this is to see how craigslist returns 
      # these geolocations. Watch what happens when you request us/fl/non-existant/page/here.
      # I also made this a little forgiving in a couple ways not specified with official support, per 
      # the rules above.
      full_path_parts = full_path.scan PATH_SCANNER

      # We'll either find a single site in this loop andf return that, or, we'll find a whole listing
      # and set the geo_listing object to reflect that
      geo_listing = nil
      full_path_parts.each_with_index do |part, i|

        # Let's un-escape the path-part, if needed:
        part.gsub! "\\/", "/"        

        # If they're specifying a single site, this will catch and return it immediately
        site = geo_listing.sites.find{ |n,s| 
          (SITE_PREFIX.match s and $1 == part) or n == part
        } if geo_listing

        # This returns the site component of the found array
        return [site.last] if site 

        begin
          # The URI escape is mostly needed to translate the space characters
          l = GeoListings.new base_url+full_path_parts[0...i+1].collect{|p| URI.escape p}.join('/')
        rescue CraigScrape::Scraper::FetchError
          bad_geo_path! full_path
        end

        # This probably tells us the first part of the path was 'correct', but not the rest:
        bad_geo_path! full_path if geo_listing and geo_listing.location == l.location

        geo_listing = l
      end

      # We have a valid listing page we found, and we can just return all the sites on it:
      geo_listing.sites.collect{|n,s| s }
    end

Public Instance methods

Returns the GeoLocation‘s full name

[Source]

# File lib/geo_listings.rb, line 36
    def location
      unless @location
        cursor = html % 'h3 > b > a:first-of-type'
        cursor = cursor.next_node if cursor       
        @location = $1 if cursor and LOCATION_NAME.match he_decode(cursor.to_s)
      end
      
      @location
    end

Returns a hash of site name to urls in the current listing

[Source]

# File lib/geo_listings.rb, line 47
    def sites
      unless @sites
        @sites = {}
        (html / 'div#list > a').each do |el_a|
          site_name = he_decode strip_html(el_a.inner_html)
          @sites[site_name] = $1 if URL_HOST_PART.match el_a[:href]
        end
      end
      
      @sites
    end

[Validate]