NAME

App::lcpan::Manual::Internals - App::lcpan internals

VERSION

version 1.054.000

INDEXING

Indexing is done in at least 3 passes. The first pass is the most important. You can skip some passes if you don't need the information a pass gathers.

First pass

First, we parse authors/01mailrc.txt.gz and insert the data into author table. Some DarkPANs like those produced by OrePAN have authors/00whois.xml instead. Basically we're just assigning numeric ID's to PAUSE (author) ID's.

Then we parse modules/02packages.details.txt.gz, which is the main meat of CPAN index. This file links package (module) names to release tarballs. A snippet from the file:

...
Log::ger                          0.037  P/PE/PERLANCAR/Log-ger-0.037.tar.gz
Log::ger::App                     0.014  P/PE/PERLANCAR/Log-ger-App-0.014.tar.gz
Log::ger::DBI::Query              0.001  P/PE/PERLANCAR/Log-ger-DBI-Query-0.001.tar.gz
Log::ger::Filter                  0.037  P/PE/PERLANCAR/Log-ger-0.037.tar.gz
Log::ger::Filter::Code            0.037  P/PE/PERLANCAR/Log-ger-0.037.tar.gz
...

We insert these records to file table (so each release file gets a numeric file ID) and module table (so each module gets a numeric module ID as well as link to its file ID).

At this point, we haven't parsed distribution names yet because that will need information from META.{json,yaml} inside the release files.

Then we start to examine the release files. First we list of content of each release file and store the results into the content table.

We also populate the script table by heuristically including content which from its name looks like script, e.g.:

script/foo
bin/whatever

We then extract the distribution metadata files (either META.json or META.yaml) and store the information contained in these metadata files into the database. These include the distribution name (so we populate the dist table) and the dependency information (the dep table).

At the end of this first pass, we have a pretty useful database already; because one of the main uses of lcpan is to provide dependency information.

Second pass: POD parsing

In the second pass, we extract module and script files inside each release file into temporary directory, then parse its POD. This pass usually takes several times the amount of time it takes to complete the first pass. At the time of this writing (2020-04-19) on my computer, the first pass takes about 14 minutes and the second pass takes 72 minutes. A big release file that contains thousands of (mostly autogenerated) module files (yes, they exist; see Paws for example) can take 25 minutes on its own. You might want to skip those files if you do not expect to ever need to deal with the module/distribution; see the lcpan update documentation.

By parsing POD, we get: module/script abstract (stored into module table) and mentions (i.e. a POD that links to another POD, stored in mention table). The mentions information is mainly useful to know how related a module is to another (see lcpan related-mods subcommand).

Third pass: subroutine

In this pass, we try to extract subroutine names in modules. This requires the use of a source code lexer (lcpan uses Compiler::Lexer). On my computer, this pass takes another 19 minutes.

AUTHOR

perlancar <perlancar@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2020 by perlancar@cpan.org.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.