NAME

Text::Summarizer - Summarize Bodies of Text

SYNOPSIS

use Text::Summarizer;

# all constructor arguments shown are OPTIONAL and reflect the DEFAULT VALUES of each attribute
$summarizer = Text::Summarizer->new(
	articles_path  => 'articles/*',
	permanent_path => 'data/permanent.stop',
	stopwords_path => 'data/stopwrods.stop',
	store_scanner  => 0,
	print_scanner  => 0,
	print_summary  => 0,
	return_count   => 20,
	phrase_thresh  => 2,
	phrase_radius  => 5,
	freq_constant  => 0.004,
);

$summarizer = Text::Summarizer->new();
	# to summarize a string
$stopwords = $summarizer->scan_text( 'this is a sample text' );
$summary   = $summarizer->summ_text( 'this is a sample text' );
    # or to summarize an entire file
$stopwords = $summarizer->scan_file("some/file.txt");
$summary   = $summarizer->summ_file("some/file.txt");
	# or to summarize in bulk
@stopwords = $summarizer->scan_each("/directory/glob/*");  # if no argument provided, defaults to the 'articles_path' attribute
@summaries = $summarizer->summ_each("/directory/glob/*");  # if no argument provided, defaults to the 'articles_path' attribute

DESCRIPTION

This module allows you to summarize bodies of text into a scored hash of sentences, phrase-fragments, and individual words from the provided text. These scores reflect the weight (or precedence) of the relative text-fragments, i.e. how well they summarize or reflect the overall nature of the text. All of the sentences and phrase-fragments are drawn from within the existing text, and are NOT proceedurally generated.

ATTRIBUTES

The following constructor attributes are available to the user, and can be accessed/modified at any time via $summarizer->[attribute]:

These attributes are read-only, and can be accessed via $summarizer->[attribute]:

FUNCTIONS

scan

Scan is a utility that allows the Text::Summarizer to parse through a body of text to find words that occur with unusually high frequency. These words are then stored as new stopwords via the provided stopwords_path. Additionally, calling any of the three scan_[...] subroutines will return a reference (or array of references) to an unordered list containing the new stopwords.

$stopwords = $summarizer->scan_text( 'this is a sample text' );
$stopwords = $summarizer->scan_file( 'some/file/path.txt' );
@stopwords = $summarizer->scan_each( 'some/directory/*' );  # if no argument provided, defaults to the 'articles_path' attribute

summarize

Summarizing is, not surprisingly, the heart of the Text::Summarizer. Summarizing a body of text provides three distinct categories of information drawn from the existing text and ordered by relevance to the summary: full sentences, phrase-fragments / context-free token streams, and a list of frequently occuring words.

There are three provided functions for summarizing text documents:

$summary   = $summarizer->summarize_text( 'this is a sample text' );
$summary   = $summarizer->summarize_file( 'some/file/path.txt' );
@summaries = $summarizer->summarize_each( 'some/directory/*' );  # if no argument provided, defaults to the 'articles_path' attribute
	# or their short forms
$summary   = $summarizer->summ_text('...');
$summary   = $summarizer->summ_file('...');
@sumamries = $summarizer->summ_each('...');  # if no argument provided, defaults to the 'articles_path' attribute

summarize_text and summarize_file each return a summary hash-ref containing three array-refs, while summarize_each returns a list of these hash-refs. These summary hashes take the following form:

About Fragments

Phrase fragments are in actuallity short "scraps" of text (usually only two or three words) that are derived from the text via the following process:

  1. the entirety of the text is tokenized and scored into a frequency table, with a high-pass threshold of frequencies above # of tokens * user-defined scaling factor
  2. each sentence is tokenized and stored in an array
  3. for each word within the frequency table, a table of phrase-fragments is derived by finding each occurance of said word and tracking forward and backward by a user-defined "radius" of tokens (defaults to radius = 5, does not include the central key-word) — each phrase-fragment is thus compiled of (by default) an 11-token string
  4. all fragments for a given key-word are then compared to each other, and each word is deleted if it appears only once amongst all of the fragments (leaving only AB ∪ ... ∪ S where A, B,..., S are the phrase-fragments)
  5. what remains of each fragment is a list of "scraps" — strings of consecutive tokens — from which the longest scrap is chosen as a representation of the given phrase-fragment
  6. when a shorter fragment-scrap (A) is included in the text of a longer scrap (B) such that AB, the shorter is deleted and its score is added to that of the longer
  7. when multiple fragments are equivalent (i.e. they consist of the same list of tokens when stopwords are excluded), they are condensed into a single scrap in the form of "(some|word|tokens)" such that the fragment now represents the tokens of the scrap (excluding stopwords) regardless of order (refered to as a "context-free token stream")

SUPPORT

Bugs should always be submitted via the project hosting bug tracker

https://github.com/faelin/text-summarizer/issues

For other issues, contact the maintainer.

AUTHOR

Faelin Landy faelin.landy@gmail.com (current maintainer)

CONTRIBUTORS

COPYRIGHT AND LICENSE

Copyright (C) 2018 by the AUTHOR as listed above

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.