NAME
App::lcpan::Manual::Internals - App::lcpan internals
VERSION
version 1.052.000
INDEXING
Indexing is done in at least 3 passes. The first pass is the most important. You can skip some passes if you don't need the information a pass gathers.
First pass
First, we parse authors/01mailrc.txt.gz and insert the data into author
table. Some DarkPANs like those produced by OrePAN have authors/00whois.xml instead. Basically we're just assigning numeric ID's to PAUSE (author) ID's.
Then we parse modules/02packages.details.txt.gz, which is the main meat of CPAN index. This file links package (module) names to release tarballs. A snippet from the file:
...
Log::ger 0.037 P/PE/PERLANCAR/Log-ger-0.037.tar.gz
Log::ger::App 0.014 P/PE/PERLANCAR/Log-ger-App-0.014.tar.gz
Log::ger::DBI::Query 0.001 P/PE/PERLANCAR/Log-ger-DBI-Query-0.001.tar.gz
Log::ger::Filter 0.037 P/PE/PERLANCAR/Log-ger-0.037.tar.gz
Log::ger::Filter::Code 0.037 P/PE/PERLANCAR/Log-ger-0.037.tar.gz
...
We insert these records to file
table (so each release file gets a numeric file ID) and module
table (so each module gets a numeric module ID as well as link to its file ID).
At this point, we haven't parsed distribution names yet because that will need information from META.{json,yaml} inside the release files.
Then we start to examine the release files. First we list of content of each release file and store the results into the content
table.
We also populate the script
table by heuristically including content which from its name looks like script, e.g.:
script/foo
bin/whatever
We then extract the distribution metadata files (either META.json or META.yaml) and store the information contained in these metadata files into the database. These include the distribution name (so we populate the dist
table) and the dependency information (the dep
table).
At the end of this first pass, we have a pretty useful database already; because one of the main uses of lcpan is to provide dependency information.
Second pass: POD parsing
In the second pass, we extract module and script files inside each release file into temporary directory, then parse its POD. This pass usually takes several times the amount of time it takes to complete the first pass. At the time of this writing (2020-04-19) on my computer, the first pass takes about 14 minutes and the second pass takes 72 minutes. A big release file that contains thousands of (mostly autogenerated) module files (yes, they exist; see Paws for example) can take 25 minutes on its own. You might want to skip those files if you do not expect to ever need to deal with the module/distribution; see the lcpan update
documentation.
By parsing POD, we get: module/script abstract (stored into module
table) and mentions (i.e. a POD that links to another POD, stored in mention
table). The mentions information is mainly useful to know how related a module is to another (see lcpan related-mods
subcommand).
Third pass: subroutine
In this pass, we try to extract subroutine names in modules. This requires the use of a source code lexer (lcpan uses Compiler::Lexer). On my computer, this pass takes another 19 minutes.
AUTHOR
perlancar <perlancar@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2020 by perlancar@cpan.org.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.