Suck::Huggingface

Clone repos from Huggingface and then download model files.

TL;DR SUMMARY

$ apt-get install git
$ apt-get install wget
$ sudo cpan JSON::MaybeXS
$ sudo cpan File::Valet
$ sudo cpan Time::TAI::Simple
$ bin/suck-hug

The suck-hug utility will spew usage information at you and exit. Should be fairly self-explanatory.

DESCRIPTION

I got tired of manually downloading models and datasets from Huggingface.

So, I wrote this thinger to do it for me.

Huggingface exports models as git repos, similar to Github, but files above a certain size (like models, which is kind of the whole point) don't actually get downloaded when you clone a model repo.

They have to be downloaded separately. Sometimes this is hundreds of files. It is a huge pain in the ass.

The bin/suck-hug utility will do this for you. Give it a list of repo URLs and it will "git clone" the repos, scan them for files which need to be downloaded, and use wget to fetch them.

It really should check the download digests, but it doesn't. There is some code in place for enabling that, but I just haven't gotten around to it yet.

USAGE

usage: bin/suck-hug [options] https://huggingface.co/bigscience/bloom [more urls ...]
Will clone repos and download files too big to be included in repo
General options:
  -h, --help       Show this usage and exit
  --exclude=xx,yy  Do not download external files with xx or yy in their name
  --rate-limit=#[KMG]  Set rate limiting for wget (default: unlimited)
  --sort           Order downloads by size; download smallest first.
  --too_big=###    Examine files smaller than this for download, bytes (default: 300)
  --retries=###    How many times to retry failed wget (default: 1000)
  --username=XXX   If Huggingface requires auth, use XXX for username.
  --password=YYY   If Huggingface requires auth, use YYY for password.
  -q               Suppress output
  -v               More verbose output
Logging options:
  --log-dir=PATH   Write logfile in this directory (default: /var/tmp)
  --logfile=PNAME  Write logfile to this exact pathname (overrides --log-dir)
  --log-level=#    Set higher for more debug logging (default: 3, max: 7)
      level 0: only log CRITICAL records
      level 1: log CRITICAL, ERROR
      level 2: log CRITICAL, ERROR, WARNING
      level 3: log CRITICAL, ERROR, WARNING, INFO
      level 4+ log CRITICAL, ERROR, WARNING, INFO, DEBUG (many debug levels)
  --no-log         Suppress logging
  --no-logfile     Do not write log to file, can still show to stderr/stdout
  --show-log       Display log messages to stderr
  --show-log-to-stdout  Display log messages to stdout

What about using the module?

my $shug = Suck::Huggingface->new(%options);
my ($ok, @result) = $shug->suck($url);
if ($ok eq 'OK') {
   my ($n_dl, $n_bytes) = @result;
   print "yaay! downloaded $n_dl files summing $n_bytes bytes\n";
} else {
   print "oh noes, errors!\n", join("\n", @result), "\n";
}

That's basically it. There are no other methods intended for end-user use.

The parameters to new() are just the same as the utility's command line options, but with hyphens turned into underscores, like:

my $shug = Suck::Huggingface->new(log_level => 5, exclude => 'q2,q3,q5,q8')

LOGGING

S:H uses a stripped-down version of my structured logger. By default it will write log records to /var/tmp/suck-huggingface.log as newline- separated JSON arrays. Use json2json or similar to pretty-format them to taste:

$ tail -n 1 /var/tmp/suck-huggingface.log | json2json -l
[ 1683950423.24891, "Fri May 12 21:01:00 2023", 16334, "INFO", 3, ["SKH:E0701EB9"],
  [ ["lib/Suck/Huggingface.pm", 145, "Suck::Huggingface::info"],
    ["bin/suck-hug", 38, "Suck::Huggingface::suck"],
    ["bin/suck-hug", 25, "main"]
  ],
  "done downloading files for this repo", {"total_size_bytes": 4437545, "n_downloaded": 1}
]

Those record elements are, in order:

Is this overkill for such a simple tool? Hell yeah, but it's what I use for everything, and it's nice for monitoring the progress of long download tasks.

OS SUPPORT

LINUX

I've tested this on Slackware 15.0 and it works fine.

It does not work on CentOS 6 because Huggingface wants newer SSL capabilities than CentOS 6 git provides.

It mostly works on CentOS 7, but CentOS 7's git downloads large files too, which are supposed to be "pointers", so --exclude does not work as expected.

Will give it a go on Debian later.

If you try it on other distributions, please let me know how it goes.

BSD

Totally should work. Will test it eventually.

MacOSX

Might work. Might test it later.

WINDOWS

Hahahahaha good luck!

TO DO

CONTACT

ttk (at) ciar (dot) org

https://old.reddit.com/u/ttkciar

Libera IRC channels ##slackware-help or #perl