Package org.apache.nutch.any23
Class Any23ParseFilter
- java.lang.Object
-
- org.apache.nutch.any23.Any23ParseFilter
-
- All Implemented Interfaces:
Configurable,HtmlParseFilter,Pluggable
public class Any23ParseFilter extends Object implements HtmlParseFilter
This implementation of
HtmlParseFilteruses the Apache Any23 library for parsing and extracting structured data in RDF format from a variety of Web documents. The supported formats can be found at Apache Any23.In this implementation triples are written as Notation3 and triples are identified within output triple streams by the presence of '\n'. The presence of the '\n' is a characteristic specific to N3 serialization in Any23. In order to use another/other writers implementing the TripleHandler interface, we will most likely need to identify an alternative data characteristic which we can use to split triples streams.
-
-
Field Summary
Fields Modifier and Type Field Description static StringANY_23_CONTENT_TYPES_CONFstatic StringANY_23_EXTRACTORS_CONFstatic StringANY23_TRIPLESConstant identifier used as a Key for writing and reading triples to and from the metadata Map field.-
Fields inherited from interface org.apache.nutch.parse.HtmlParseFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description Any23ParseFilter()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description ParseResultfilter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.ConfigurationgetConf()voidsetConf(Configuration conf)
-
-
-
Field Detail
-
ANY23_TRIPLES
public static final String ANY23_TRIPLES
Constant identifier used as a Key for writing and reading triples to and from the metadata Map field.- See Also:
- Constant Field Values
-
ANY_23_EXTRACTORS_CONF
public static final String ANY_23_EXTRACTORS_CONF
- See Also:
- Constant Field Values
-
ANY_23_CONTENT_TYPES_CONF
public static final String ANY_23_CONTENT_TYPES_CONF
- See Also:
- Constant Field Values
-
-
Method Detail
-
getConf
public Configuration getConf()
- Specified by:
getConfin interfaceConfigurable
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConfin interfaceConfigurable
-
filter
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Description copied from interface:HtmlParseFilterAdds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.- Specified by:
filterin interfaceHtmlParseFilter- Parameters:
content- theContentfor a given responseparseResult- the result of running on or moreParser's on the content.metaTags- a populatedHTMLMetaTagsobjectdoc- aDocumentFragment(DOM) which can be processed in the filtering process.- Returns:
- a filtered
ParseResult - See Also:
HtmlParseFilter.filter(Content, ParseResult, HTMLMetaTags, DocumentFragment)
-
-