NAME
App::Wubot::Reactor::HTMLStrip - strip HTML data from a field
VERSION
version 0.4.2
SYNOPSIS
- name: strip HTML from 'title' field and store results in the field title_text
plugin: HTMLStrip
config:
field: title
- name: strip HTML from the title field in-situ
plugin: HTMLStrip
config:
field: title
target_field: title
DESCRIPTION
The HTMLStrip plugin uses the perl module HTML::Strip to remove HTML from a field. The original field content is not overwritten by default. If you do not specify a 'target_field', then the HTML-stripped content will be stored in a newly created field that hast the same name as the original field plus _text. For example, if you use the 'subject' field, the results will go into 'subject_text' by default. If you specify a 'target_field' in the config, then the HTML-stripped text will be stored in that field. If you want to replace the contents of an existing field with the HTML-stripped content, set 'field' and 'target_field' to the same field.
HTML::Strip can leave many \xA0 characters in the text which can be difficult to deal with. So HTMLStrip replaces all such characters with a single whitespace.
If the new field is utf8 (according to utf8::is_utf8), then the new field will be passed to utf8::encode().