| Class | CraigScrape::Posting |
| In: |
lib/posting.rb
|
| Parent: | CraigScrape::Scraper |
Posting represents a fully downloaded, and parsed, Craigslist post. This class is generally returned by the listing scrape methods, and contains the post summaries for a specific search url, or a general listing category
| POST_DATE | = | /Date:[^\d]*((?:[\d]{2}|[\d]{4})\-[\d]{1,2}\-[\d]{1,2}[^\d]+[\d]{1,2}\:[\d]{1,2}[ ]*[AP]M[^a-z]+[a-z]+)/i |
| LOCATION | = | /Location\:[ ]+(.+)/ |
| HEADER_LOCATION | = | /^.+[ ]*\-[ ]*[\$]?[\d]+[ ]*\((.+)\)$/ |
| POSTING_ID | = | /PostingID\:[ ]+([\d]+)/ |
| REPLY_TO | = | /(.+)/ |
| PRICE | = | /((?:^\$[\d]+(?:\.[\d]{2})?)|(?:\$[\d]+(?:\.[\d]{2})?$))/ |
| USERBODY_PARTS | = | /^(.+)\<div id\=\"userbody\">(.+)\<br[ ]*[\/]?\>\<br[ ]*[\/]?\>(.+)\<\/div\>(.+)$/m |
| HTML_HEADER | = | /^(.+)\<div id\=\"userbody\">/m |
| IMAGE_SRC | = | /\<im[a]?g[e]?[^\>]*src=(?:\'([^\']+)\'|\"([^\"]+)\"|([^ ]+))[^\>]*\>/ |
Create a new Post via a url (String), or supplied parameters (Hash)
# File lib/posting.rb, line 29 def initialize(*args) super(*args) # Validate that required fields are present, at least - if we've downloaded it from a url parse_error! if ( args.first.kind_of? String and !flagged_for_removal? and !posting_has_expired? and !deleted_by_author? and [ contents,posting_id,post_time,header,title,full_section ].any?{|f| f.nil? or (f.respond_to? :length and f.length == 0)} ) end
Returns true if this Post was parsed, and represents a ‘This posting has been deleted by its author.’ notice
# File lib/posting.rb, line 188 def deleted_by_author? @deleted_by_author = ( system_post? and header_as_plain == "This posting has been deleted by its author." ) if @deleted_by_author.nil? @deleted_by_author end
Returns true if this Post was parsed, and merely a ‘Flagged for Removal’ page
# File lib/posting.rb, line 179 def flagged_for_removal? @flagged_for_removal = ( system_post? and header_as_plain == "This posting has been flagged for removal" ) if @flagged_for_removal.nil? @flagged_for_removal end
Array, hierarchial representation of the posts section
# File lib/posting.rb, line 66 def full_section unless @full_section @full_section = [] (html_head/"div[@class='bchead']//a").each do |a| @full_section << he_decode(a.inner_html) unless a['id'] and a['id'] == 'ef' end if html_head end @full_section end
true if post summary has ‘img(s)’. ‘imgs’ are different then pics, in that the resource is not hosted on craigslist‘s server. This is always able to be pulled from the listing post-summary, and should never cause an additional page load
# File lib/posting.rb, line 254 def has_img? img_types.include? :img end
true if post summary has ‘pic(s)’. ‘pics’ are different then imgs, in that craigslist is hosting the resource on craigslist‘s servers This is always able to be pulled from the listing post-summary, and should never cause an additional page load
# File lib/posting.rb, line 260 def has_pic? img_types.include? :pic end
Array, urls of the post‘s images that are not hosted on craigslist
# File lib/posting.rb, line 151 def images # Keep in mind that when users post html to craigslist, they're often not posting wonderful html... @images = ( contents ? contents.scan(IMAGE_SRC).collect{ |a| a.find{|b| !b.nil? } } : [] ) unless @images @images end
Array, which image types are listed for the post. This is always able to be pulled from the listing post-summary, and should never cause an additional page load
# File lib/posting.rb, line 231 def img_types unless @img_types @img_types = [] @img_types << :img if images.length > 0 @img_types << :pic if pics.length > 0 end @img_types end
Returns The post label. The label would appear at first glance to be indentical to the header - but its not. The label is cited on the listings pages, and generally includes everything in the header - with the exception of the location. Sometimes there‘s additional information ie. ’(map)’ on rea listings included in the header, that aren‘t to be listed in the label This is also used as a bandwidth shortcut for the craigwatch program, and is a guaranteed identifier for the post, that won‘t result in a full page load from the post‘s url.
# File lib/posting.rb, line 219 def label unless @label or system_post? @label = header @label = $1 if location and /(.+?)[ ]*\(#{location}\).*?$/.match @label end @label end
String, the location of the item, as best could be parsed
# File lib/posting.rb, line 122 def location if @location.nil? and craigslist_body and html # Location (when explicitly defined): cursor = craigslist_body.at 'ul' unless @location # Apa section includes other things in the li's (cats/dogs ok fields) cursor.children.each do |li| if LOCATION.match li.inner_html @location = he_decode($1) and break break end end if cursor # Real estate listings can work a little different for location: unless @location cursor = craigslist_body.at 'small' cursor = cursor.previous_node until cursor.nil? or cursor.text? @location = he_decode(cursor.to_s.strip) if cursor end # So, *sometimes* the location just ends up being in the header, I don't know why: @location = $1 if @location.nil? and HEADER_LOCATION.match header end @location end
Array, urls of the post‘s craigslist-hosted images
# File lib/posting.rb, line 163 def pics unless @pics @pics = [] if html and craigslist_body # Now let's find the craigslist hosted images: img_table = (craigslist_body / 'table').find{|e| e.name == 'table' and e[:summary] == 'craigslist hosted images'} @pics = (img_table / 'img').collect{|i| i[:src]} if img_table end end @pics end
Reflects only the date portion of the posting. Does not include hours/minutes. This is useful when reflecting the listing scrapes, and can be safely used if you wish conserve bandwidth by not pulling an entire post from a listing scrape.
# File lib/posting.rb, line 208 def post_date @post_date = Time.local(*[0]*3+post_time.to_a[3...10]) unless @post_date or post_time.nil? @post_date end
Time, reflects the full timestamp of the posting
# File lib/posting.rb, line 90 def post_time unless @post_time cursor = html_head.at 'hr' if html_head cursor = cursor.next_node until cursor.nil? or POST_DATE.match cursor.to_s @post_time = Time.parse $1 if $1 end @post_time end
Returns true if this Post was parsed, and represents a ‘This posting has expired.’ notice
# File lib/posting.rb, line 197 def posting_has_expired? @posting_has_expired = ( system_post? and header_as_plain == "This posting has expired." ) if @posting_has_expired.nil? @posting_has_expired end
Integer, Craigslist‘s unique posting id
# File lib/posting.rb, line 101 def posting_id unless @posting_id cursor = Hpricot.parse html_footer if html_footer cursor = cursor.next_node until cursor.nil? or POSTING_ID.match cursor.to_s @posting_id = $1.to_i if $1 end @posting_id end
Returns the best-guess of a price, judging by the label‘s contents. Price is available when pulled from the listing summary and can be safely used if you wish conserve bandwidth by not pulling an entire post from a listing scrape.
# File lib/posting.rb, line 272 def price $1.tr('$','').to_f if label and PRICE.match label end
String, represents the post‘s reply-to address, if listed
# File lib/posting.rb, line 79 def reply_to unless @reply_to cursor = html_head.at 'hr' if html_head cursor = cursor.next_sibling until cursor.nil? or cursor.name == 'a' @reply_to = $1 if cursor and REPLY_TO.match he_decode(cursor.inner_html) end @reply_to end
Retrieves the most-relevant craigslist ‘section’ of the post. This is generally the same as full_section.last. However, this (sometimes/rarely) conserves bandwidth by pulling this field from the listing post-summary
# File lib/posting.rb, line 244 def section unless @section @section = full_section.last if full_section end @section end