public class HtmlCleanerContentRewriter extends Object implements ContentRewriter
Constructor and Description |
---|
HtmlCleanerContentRewriter()
Zero-argument default constructor.
|
Modifier and Type | Method and Description |
---|---|
void |
addTagNodeVisitor(TagNodeVisitor tagNodeVisitor)
Adds a custom
TagNodeVisitor which can get the chance to do custom processing on
the selected tag nodes. |
protected HtmlCleaner |
createHtmlCleaner()
Creates a
HtmlCleaner instance. |
CleanerTransformations |
getCleanerTransformations()
Returns
CleanerTransformations of the underlying HtmlCleaner properties. |
String[] |
getCleanerTransformationStringArray()
Returns an array of parsing transformation string.
|
protected HtmlCleaner |
getHtmlCleaner()
Returns the underlying
HtmlCleaner instance. |
SerializerFactory |
getSerializerFactory()
Returns
SerializerFactory instance. |
String |
getSinkEncoding()
Returns the character encoding to be used to write to
sink . |
List<TagNodeVisitor> |
getTagNodeVisitors()
Returns custom
TagNodeVisitor s which can get the chance to do custom processing on
the selected tag nodes. |
String |
getXpathExpression()
Returns the XPath expression to be used to select some filtered tag node(s) only.
|
boolean |
isInnerHtmlOnly()
Returns the flag whether or not the output should include the inner HTML(s) only.
|
void |
removeAllTagNodeVisitors()
Removes all the custom
TagNodeVisitor s. |
void |
removeTagNodeVisitor(TagNodeVisitor tagNodeVisitor)
Removes the specified custom
TagNodeVisitor . |
void |
rewrite(Source source,
Sink sink,
ContentRewritingContext context)
Reads content from the
source , transforms the content and writes to the sink . |
void |
setCleanerTransformations(CleanerTransformations cleanerTransformations)
Sets
CleanerTransformations of the underlying HtmlCleaner properties. |
void |
setCleanerTransformationStringArray(String[] transInfos)
Sets an array of parsing transformation string.
|
void |
setInnerHtmlOnly(boolean innerHtmlOnly)
Sets the flag whether or not the output should include the inner HTML(s) only.
|
void |
setSerializerFactory(SerializerFactory serializerFactory)
Sets
SerializerFactory property. |
void |
setSinkEncoding(String sinkEncoding)
Sets the character encoding to be used to write to
sink . |
void |
setTagNodeVisitors(List<TagNodeVisitor> tagNodeVisitors)
Sets custom
TagNodeVisitor s which can get the chance to do custom processing on
the selected tag nodes. |
void |
setXpathExpression(String xpathExpression)
Sets the XPath expression to be used to select some filtered tag node(s) only.
|
public HtmlCleanerContentRewriter()
public SerializerFactory getSerializerFactory()
SerializerFactory
instance.
If no SerializerFactory
was set in prior,
then it creates and returns a new instance of DefaultSerializerFactory
with using SimpleHtmlSerializer
by default.
public void setSerializerFactory(SerializerFactory serializerFactory)
SerializerFactory
property.serializerFactory
- public String getSinkEncoding()
sink
.
The default return value is 'UTF-8' if not set.public void setSinkEncoding(String sinkEncoding)
sink
.sinkEncoding
- public String getXpathExpression()
public void setXpathExpression(String xpathExpression)
xpathExpression
- public boolean isInnerHtmlOnly()
sink
.public void setInnerHtmlOnly(boolean innerHtmlOnly)
sink
.innerHtmlOnly
- public CleanerTransformations getCleanerTransformations()
CleanerTransformations
of the underlying HtmlCleaner
properties.public void setCleanerTransformations(CleanerTransformations cleanerTransformations)
CleanerTransformations
of the underlying HtmlCleaner
properties.cleanerTransformations
- public String[] getCleanerTransformationStringArray()
Since HtmlCleaner 2.1, it introduces a way to quickly skip specified tags and/or attributes or to transform them to some other tags/attributes during parsing process, avoiding expansive document object model manipulation after cleaning.
Here are example transformation rules applied in the cleaning process:
Example rule | Explanation |
---|---|
cfouput |
cfouput tag will be ignored by parser (but not content inside)
|
c:block->div,false |
c:block tag will be transformed to div tag and
all original attributes will be ignored (false in tranformation description).
|
font->span,true |
font tag will be transformed to span and
original attributes will be preserved.
|
font.size |
font tag will still be transformed to span and
original attributes will be preserved thanks to the preceding rule added above,
except of the specified size attribute.
The size attribute will be removed.
|
font.face |
font tag will still be transformed to span and
original attributes will be preserved thanks to the preceding rule added above,
except of the specified face attribute.
The face attribute will be removed.
|
font.style=${style};font-family=${face};font-size=${size}; |
font tag will still be transformed to span and
original attributes will be preserved thanks to the preceding rule added above.
And attribute style has more complex transformation rule:
it will be translated to value given by the template ${style};font-family=${face};font-size=${size}; .
So, the style attribute of the original font tag will be prepended to
the style attribute of the new span tag.
The face attribute of the original font tag will be appended to
style attribute of the new span tag as font-family property.
The size attribute of the original font tag will be appended to
style attribute of the new span tag as font-size property.
Template is evaluated against source tag attributes (names between ${ and }). |
Suppose you have example HTML markups like the following:
...My content 1... <cfoutput> Yin and yang describe the polar effects of phenomena. </cfoutput> ...My content 2... <c:block parent=b1 count=331> Yin-yang are Mutually Rooted </c:block> ...My content 3... <font id=f21 size=12 face=Arial style="color:red"> The Yin and yang aspects are in dynamic equilibrium </font> ...My content 4...
Based on the example transformation rules shown above, it will be transformed like this:
...My content 1... Yin and yang describe the polar effects of phenomena. ...My content 2... <div> Yin-yang are Mutually Rooted </div> ...My content 3... <span id="f21" style="color:red;font-family=Arial;font-size=12;"> The Yin and yang aspects are in dynamic equilibrium </span> ...My content 4...
public void setCleanerTransformationStringArray(String[] transInfos)
getCleanerTransformationStringArray()
for details.transInfos
- public List<TagNodeVisitor> getTagNodeVisitors()
TagNodeVisitor
s which can get the chance to do custom processing on
the selected tag nodes.public void setTagNodeVisitors(List<TagNodeVisitor> tagNodeVisitors)
TagNodeVisitor
s which can get the chance to do custom processing on
the selected tag nodes.public void addTagNodeVisitor(TagNodeVisitor tagNodeVisitor)
TagNodeVisitor
which can get the chance to do custom processing on
the selected tag nodes.public void removeTagNodeVisitor(TagNodeVisitor tagNodeVisitor)
TagNodeVisitor
.public void removeAllTagNodeVisitors()
TagNodeVisitor
s.public void rewrite(Source source, Sink sink, ContentRewritingContext context) throws ContentRewritingException, IOException
source
, transforms the content and writes to the sink
.
This method basically gets a Reader
from the source
and cleans it with an HtmlCleaner
retrieved from a getHtmlCleaner()
call.
And, it checks if there's an XPath expression configuration by calling on getXpathExpression()
.
If there's an XPath expression configuration, then it selects only the element(s) filtered by the XPath expression property.
Otherwise, it selects the root tag node by default.
Afterward, it checks if there's any TagNodeVisitor
set by calling on getTagNodeVisitors()
.
If there's any TagNodeVisitor
s, then it invokes TagNodeVisitor.visit(TagNode, org.htmlcleaner.HtmlNode)
to give custom TagNodeVisitor
s chances to do some custom tag node handling
on either filtered tag node(s) or the root tag node if no XPath expression configured.
Finally, it serializes the selected tag node(s) to the sink
by internally creating a Serializer
based on other properties configuration.
rewrite
in interface ContentRewriter
source
- source of contentsink
- target of content rewrittencontext
- content rewriting contextContentRewritingException
IOException
protected HtmlCleaner getHtmlCleaner()
HtmlCleaner
instance.
If it was not initiated yet, it creates one by invoking createHtmlCleaner()
.protected HtmlCleaner createHtmlCleaner()
HtmlCleaner
instance.
By default, it sets the following properties to the HtmlCleaner
:
Copyright © 2008–2015 The Apache Software Foundation. All rights reserved.