Crawler

Usage

from metacrawler.fields import Field
from metacrawler.crawlers import Crawler


class GithubRepoCrawler(Crawler):

    repo_name = Field(xpath='//strong[@itemprop="name"]/a/text()')
    author_url = Field(xpath='//a[@rel="author"]/@href')
    description = Field(xpath='//span[@itemprop="about"]/text()')


crawler = GithubRepoCrawler()
crawler.url = 'https://github.com/pyvim/metacrawler'
content = crawler.crawl()
print(content)

# Output:
#
#{'repo_name': 'metacrawler', 'description': ' A lightweight Python crawling framework.', 'author_url': '/pyvim'}

metacrawler.crawlers.Crawler(self)

Initialization class attributes

url (optional): str. URL of page for crawling.
collapse (optional): bool. Collapse crawler key in result data. For example, {'crawler': {'field': []}} to {'cralwer': []}. Work only with single field. If fields will be more raise ValueError exception.
session (optional): requests.Session instance. Using for make requests.
pagination (optional): metacrawler.pagination.Pagination instance. Using for navigate on website pages.
authentication (optional): metacrawler.authentication.Authentication instance. Using for authentication on website.
limit (optional): int. Limit of results. Count of result may be more than limit due pagination.
timeout (optional): float. Timeout before current connection will be closed.
any crawler or field instance (optional): metacrawler.crawlers.Crawler or metacrawler.fields.Field instance.

Public attributes

url (optional): str. URL of page for crawling.
collapse (optional): bool. Collapse crawler key in result data. For example, {'crawler': {'field': []}} to {'cralwer': []}. Work only with single field. If fields will be more raise ValueError exception.
session (optional): requests.Session instance. Using for make requests.
pagination (optional): metacrawler.pagination.Pagination instance. Using for navigate on website pages.
authentication (optional): metacrawler.authentication.Authentication instance. Using for authentication on website.
limit (optional): int. Limit of results. Count of result may be more than limit due pagination.
timeout (optional): float. Timeout before current connection will be closed.
data: dict or list. Contains data after crawling.

Public methods

before(self) Call before start crawling. May use for any actions (dynamic set values and other).
clean(self, value) Call after finish crawling. May use for any actions with raw data. Must return result.
value: raw data.
crawl(self, *args, **kwargs) Start crawling.
paginate(self, page): Return next url for crawling if has pagination attribute.
page: lxml.Element lxml reperesentation of website page.
get_ATTRIBUTE_NAME(self): Set attribute ATTRIBUTE_NAME (any attribute name) by returned value at initialization.