NAME
App::lcpan::Manual::Internals - App::lcpan internals
VERSION
version 1.057.000
INDEXING
Indexing is done in several steps. The last step (parsing release files) is done in at least 3 passes. We can skip one or more of these passes to save time, if we don't need the information that the passes gather.
First step: parse authors/01mailrc.txt.gz
First, we parse authors/01mailrc.txt.gz and insert the data into author
table. Some DarkPANs like those produced by OrePAN have authors/00whois.xml instead.
Second step: parse modules/02packages.details.txt.gz
Then we parse modules/02packages.details.txt.gz, which is the main meat of CPAN index. This file links package (module) names to release tarballs. A snippet from the file:
...
Log::ger 0.037 P/PE/PERLANCAR/Log-ger-0.037.tar.gz
Log::ger::App 0.014 P/PE/PERLANCAR/Log-ger-App-0.014.tar.gz
Log::ger::DBI::Query 0.001 P/PE/PERLANCAR/Log-ger-DBI-Query-0.001.tar.gz
Log::ger::Filter 0.037 P/PE/PERLANCAR/Log-ger-0.037.tar.gz
Log::ger::Filter::Code 0.037 P/PE/PERLANCAR/Log-ger-0.037.tar.gz
...
We insert these records to file
table, so each release file gets a numeric file ID, and module
table, so each module gets a numeric module ID as well as link to its file ID.
At this point, we haven't parsed distribution names yet because that will need information from META.{json,yaml} inside the release files.
Third step: (release) files
Then we start to examine the release files. This is done in several passes and you have the option to skip some of the passes. The third step is done in multiple passes because in pass 2, we want to collect all known scripts first to be able to detect links to scripts in POD (collected in pass 1). Also some passes are more high-level and/or experimental and/or optional.
Third step pass 1: content, scripts, distribution metadata, dependency
First we list the content of each release archive and store the results into the content
table. This will allow us to check whether a distribution has a distribution metadata file (META.yml or META.json), whether a distribution contains scripts, and so on.
We populate the script
table by heuristically including content which from its name looks like script, e.g.:
script/foo
bin/whatever
We then extract the distribution metadata files (either META.json or META.yaml) and store the information contained in these metadata files into the database. These include the distribution name (written to the dist
table) and the dependency information (written to the dep
table).
At the end of this first pass, we have a pretty useful database already. One of the main uses of lcpan is to provide dependency information. You can skip the other passes if you want.
Third step pass 2: parse POD
In the second pass, we extract modules and script files inside each release file into a temporary directory, then parse their POD. This pass usually takes several times the amount of time it takes to complete the first pass. At the time of this writing (2020-04-19) on my computer, the first pass takes about 14 minutes and the second pass takes 72 minutes. A big release file that contains thousands of (mostly autogenerated) module files (yes, they exist; see Paws for example) can take 25 minutes on its own. You might want to skip those files if you do not expect to ever need to deal with the module/distribution; see the lcpan update
documentation. For example, in lcpan.conf you can put:
skip_index_file_patterns = ^Paws-\d
skip_index_file_patterns = ^Google-Ads-GoogleAds-Client-\d
skip_index_file_patterns = ^Google-Ads-AdWords-Client-\d
skip_index_file_patterns = ^eBay-API-\d
skip_index_file_patterns = ^Microsoft-AdCenter-\d
skip_index_file_patterns = ^VMOMI-\d
By parsing POD, we get: module/script abstract (stored into module
table) and mentions (i.e. a POD that links to another POD, stored in mention
table). The mentions information is mainly useful to know how related a module is to another (see lcpan related-mods
subcommand).
Third step pass 3: subroutine
In this pass, we try to extract subroutine names in modules. This requires the use of a source code lexer (lcpan uses Compiler::Lexer). On my computer, this pass takes another 19 minutes. At the time of this writing, this pass is experimental and not enabled by default.
AUTHOR
perlancar <perlancar@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2020 by perlancar@cpan.org.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.