Field

Usage

import requests
from lxml import html

from metacrawler.fields import Field


response = requests.get('https://github.com/pyvim/metacrawler')
page = html.fromstring(response.text)
field = Field(xpath='//span[@itemprop="about"]/text()')
content = field.crawl(page)
print(content)

# Output:
#
# A lightweight Python crawling framework.

metacralwer.fields.Field(self, value=None, xpath=None, fields=None, to=str)

Initialization arguments

value (optional): Any value. Using as cap.
xpath (optional): str. XPath for extact value.
fields (optional): dict. Nested fields. Format: {'field_name': field_instance}.
to (optional): type. Value representation. May be on of (list, dict, int, float, str).
list: values in result list should not be a lxml.Element.
dict: using with nested fields.
int, float, str: first found value.

Initialization class attributes

value (optional): Any value. Using as cap.
xpath (optional): str. XPath for extact value.
fields (optional): dict. Nested fields. Format: {'field_name': field_instance}.
to (optional): type. Value representation. May be on of (list, dict, int, float, str).
list: values in list should not be a lxml.Element.
dict: using with nested fields.
int, float, str: first found value.
any nested field instance (optional): metacrawler.fields.Field instance.

Public attributes

value (optional): Any value. Using as cap.
xpath (optional): str. XPath for extact value.
fields (optional): dict. Nested fields. Format: {'field_name': field_instance}.
to (optional): type. Value representation. May be on of (list, dict, int, float, str).
list: values in list should not be a lxml.Element.
dict: using with nested fields.
int, float, str: first found value.

Public methods

before(self) Call before start crawling. May use for any actions (dynamic set values and other).
clean(self, value) Call after finish crawling. May use for any actions with raw data. Must return result.
value: raw data.
crawl(self, page) Start crawling.
page: lxml.Element lxml reperesentation of website page.
get_ATTRIBUTE_NAME(self): Set attribute ATTRIBUTE_NAME (any attribute name) by returned value at initialization.