Usage
import requests
from lxml import html
from metacrawler.fields import Field
response = requests.get('https://github.com/pyvim/metacrawler')
page = html.fromstring(response.text)
field = Field(xpath='//span[@itemprop="about"]/text()')
content = field.crawl(page)
print(content)
# Output:
#
# A lightweight Python crawling framework.
Initialization arguments
-
value
(optional):
Any value. Using as cap.
-
xpath
(optional): str
.
XPath for extact value.
-
fields
(optional): dict
.
Nested fields. Format: {'field_name': field_instance}
.
-
to
(optional): type
.
Value representation. May be on of (list, dict, int, float, str)
.
list
: values in result list
should not be a lxml.Element
.
dict
: using with nested fields.
int
, float
, str
: first found value.
Initialization class attributes
-
value
(optional):
Any value. Using as cap.
-
xpath
(optional): str
.
XPath for extact value.
-
fields
(optional): dict
.
Nested fields. Format: {'field_name': field_instance}
.
-
to
(optional): type
.
Value representation. May be on of (list, dict, int, float, str)
.
list
: values in list
should not be a lxml.Element
.
dict
: using with nested fields.
int
, float
, str
: first found value.
- any nested field instance (optional):
metacrawler.fields.Field
instance.
Public attributes
-
value
(optional):
Any value. Using as cap.
-
xpath
(optional): str
.
XPath for extact value.
-
fields
(optional): dict
.
Nested fields. Format: {'field_name': field_instance}
.
-
to
(optional): type
.
Value representation. May be on of (list, dict, int, float, str)
.
list
: values in list
should not be a lxml.Element
.
dict
: using with nested fields.
int
, float
, str
: first found value.
Public methods
-
before(self)
Call before start crawling. May use for any actions (dynamic set values and other).
-
clean(self, value)
Call after finish crawling. May use for any actions with raw data. Must return result.
-
value
: raw data.
-
crawl(self, page)
Start crawling.
-
page
: lxml.Element
lxml
reperesentation of website page.
-
get_ATTRIBUTE_NAME(self)
:
Set attribute ATTRIBUTE_NAME
(any attribute name) by returned value at initialization.