Usage
import requests
from lxml import html
from metacrawler.fields import Field
response = requests.get('https://github.com/pyvim/metacrawler')
page = html.fromstring(response.text)
field = Field(xpath='//span[@itemprop="about"]/text()')
content = field.crawl(page)
print(content)
# Output:
#
# A lightweight Python crawling framework.
Initialization arguments
-
value (optional):
Any value. Using as cap.
-
xpath (optional): str.
XPath for extact value.
-
fields (optional): dict.
Nested fields. Format: {'field_name': field_instance}.
-
to (optional): type.
Value representation. May be on of (list, dict, int, float, str).
list: values in result list should not be a lxml.Element.
dict: using with nested fields.
int, float, str: first found value.
Initialization class attributes
-
value (optional):
Any value. Using as cap.
-
xpath (optional): str.
XPath for extact value.
-
fields (optional): dict.
Nested fields. Format: {'field_name': field_instance}.
-
to (optional): type.
Value representation. May be on of (list, dict, int, float, str).
list: values in list should not be a lxml.Element.
dict: using with nested fields.
int, float, str: first found value.
- any nested field instance (optional):
metacrawler.fields.Field instance.
Public attributes
-
value (optional):
Any value. Using as cap.
-
xpath (optional): str.
XPath for extact value.
-
fields (optional): dict.
Nested fields. Format: {'field_name': field_instance}.
-
to (optional): type.
Value representation. May be on of (list, dict, int, float, str).
list: values in list should not be a lxml.Element.
dict: using with nested fields.
int, float, str: first found value.
Public methods
-
before(self)
Call before start crawling. May use for any actions (dynamic set values and other).
-
clean(self, value)
Call after finish crawling. May use for any actions with raw data. Must return result.
-
value: raw data.
-
crawl(self, page)
Start crawling.
-
page: lxml.Element
lxml reperesentation of website page.
-
get_ATTRIBUTE_NAME(self):
Set attribute ATTRIBUTE_NAME (any attribute name) by returned value at initialization.