This adds some more definite types for our NLP tasks and tokenization configurations.
This is the first step in allowing users to more easily import their own transformer models via something other than hugging face.
## Changes
### Better logging
Switched from `print` statements to `logging` for a cleaner and more informative output - timestamps and log level are shown. The logging is now a bit more verbose, but it will help users to better understand what the script is doing.
### Add support for ES authentication using username/password or api key
Instead of being limited to passing credentials in the URL, there are now 2 additional methods:
- username/password using `--es-username` and `--es-password`
- API key using `--es-api-key`
Credentials can also be specified as environment variables with `ES_USERNAME`/`ES_PASSWORD` or `ES_API_KEY`
### Graceful handling of missing PyTorch requirements
In order to use the `eland_import_hub_model` script, PyTorch extras are required to be installed. If the user does not have the required packages installed, a helpful message is logged with a hint to install `eland[pytorch]` with `pip`.
### Graceful handling of already existing trained model
If a trained model with the same ID as the one we're trying to import already exists, and `--clear-previous` was not specified, we now log a clearer message about why the script can't proceed along with a hint to use the `--clear-previous` flag.
Prior to this change, we were letting the API exception seep through and the user was faced with a stack trace.
### `tqdm` added to main dependencies
If the user doesn't have `eland[pytorch]` extras installed, the first module to be reported as missing is `tqdm`. Since this module is [used in eland codebase](8294224e34/eland/ml/pytorch/_pytorch_model.py (L24)) directly, it makes sense to me to have it as part of the main set of requirements.
### Nit: Set tqdm unit to `parts` in `_pytorch_model.put_model`
The default unit is `it`, but `parts` better describes what the progress bar is tracking - uploading trained model definition parts.
We added the `pytorch` module which is type checked but was not in the
noxfile as such. This change also addresses type errors that arose after
adding type checking.
The padding parameter needs to be set on the tokenization call and not
in the constructor. Furthermore, the True value will only pad to the
largest input in a batch, however we don't trace with batches so this
value had no effect. The proper place to pass this parameter is in the
tokenization call itself and the proper value to use is "max_length"
which will pad the input to the maximum input size specified by the
model. Although we measure no functional or performance impact of this
setting, it has been suggested that this is a best practice.
See: https://huggingface.co/transformers/serialization.html#dummy-inputs-and-standard-lengths
PIT+search_after became universally safe in Elasticsearch 7.12 by adding an automatic sort tiebreaker field when using PITs called `_shard_doc` but now we need to do feature detection to make sure we use the previous scroll method on Elasticsearch <7.12 clusters
* Fix bugs with field mapping:
1. If no permission to call _mapping, return readable error
2. If index is wildcard, fix issues with user warnings
* Fixing lint issues
* Removing trailing backslashes in doc
* Remove pandas/matplotlib deprecation warning
This warning is due to a conflict between
pandas/matplotlib that may be resolved in a later
version. For now, ignore the warning so CI works.