Usage
from metacrawler.fields import Field
from metacrawler.crawlers import Crawler
class GithubRepoCrawler(Crawler):
repo_name = Field(xpath='//strong[@itemprop="name"]/a/text()')
author_url = Field(xpath='//a[@rel="author"]/@href')
description = Field(xpath='//span[@itemprop="about"]/text()')
crawler = GithubRepoCrawler()
crawler.url = 'https://github.com/pyvim/metacrawler'
content = crawler.crawl()
print(content)
# Output:
#
#{'repo_name': 'metacrawler', 'description': ' A lightweight Python crawling framework.', 'author_url': '/pyvim'}
metacrawler.crawlers.Crawler(self)
Initialization class attributes
-
url(optional):str. URL of page for crawling. -
collapse(optional):bool. Collapse crawler key in result data. For example,{'crawler': {'field': []}}to{'cralwer': []}. Work only with single field. If fields will be more raiseValueErrorexception. -
session(optional):requests.Sessioninstance. Using for make requests. -
pagination(optional):metacrawler.pagination.Paginationinstance. Using for navigate on website pages. -
authentication(optional):metacrawler.authentication.Authenticationinstance. Using for authentication on website. -
limit(optional):int. Limit of results. Count of result may be more than limit due pagination. -
timeout(optional):float. Timeout before current connection will be closed. -
any crawler or field instance (optional):
metacrawler.crawlers.Crawlerormetacrawler.fields.Fieldinstance.
Public attributes
-
url(optional):str. URL of page for crawling. -
collapse(optional):bool. Collapse crawler key in result data. For example,{'crawler': {'field': []}}to{'cralwer': []}. Work only with single field. If fields will be more raiseValueErrorexception. -
session(optional):requests.Sessioninstance. Using for make requests. -
pagination(optional):metacrawler.pagination.Paginationinstance. Using for navigate on website pages. -
authentication(optional):metacrawler.authentication.Authenticationinstance. Using for authentication on website. -
limit(optional):int. Limit of results. Count of result may be more than limit due pagination. -
timeout(optional):float. Timeout before current connection will be closed. -
data:dictorlist. Contains data after crawling.
Public methods
-
before(self)Call before start crawling. May use for any actions (dynamic set values and other). -
clean(self, value)Call after finish crawling. May use for any actions with raw data. Must return result. -
value: raw data. -
crawl(self, *args, **kwargs)Start crawling. -
paginate(self, page): Return next url for crawling if haspaginationattribute. -
page:lxml.Elementlxmlreperesentation of website page. -
get_ATTRIBUTE_NAME(self): Set attributeATTRIBUTE_NAME(any attribute name) by returned value at initialization.