Adds text_similarity task support. This is a cross-encoder transformer task where both sequences are given to the transformer at once.
According to 🤗 (or at least how the cross-encoder models are concerned) this is a sequence classification task with just one classification "label". But really, it isn't labeled at all and is more akin to a regression model.
related: elastic/elasticsearch#88439
Elasticsearch uses v1.11 of PyTorch. Models created with the latest PyTorch
release (v1.12) are not compatible with v1.11. This pins the PyTorch version
to 1.11 to prevent the incompatibility. The version of the Elasticsearch Python
client is now required to be >= Eland.
All users of Eland for importing NLP models should upgrade.
For many model types, we don't need to require the task requested. We can infer the task type based on the model configuration and architecture.
This commit makes the `task-type` parameter optional for the model up load script and adds logic for auto-detecting the task type based on the 🤗 model.
This switches our sklearn.DecisionTreeClassifier serialization logic to account for multi-valued leaves in the tree.
The key difference between our inference and DecisionTreeClassifier, is that we run a softMax over the leaf where sklearn simply normalizes the results.
This means that our "probabilities" returned will be different than sklearn.
This improves the user consumed functions and classes for PyTorch NLP model upload to Elasticsearch.
Previously it was difficult to wrap your own module for uploading to Elasticsearch.
This commit splits some classes out, adds new ones, and adds tests showing how to wrap some simple modules.
This adds some more definite types for our NLP tasks and tokenization configurations.
This is the first step in allowing users to more easily import their own transformer models via something other than hugging face.
The Cloud ID simplifies sending data to a cluster on Elastic Cloud.
With this change, the user will have the option specify a Cloud ID using the `--cloud-id` argument as an alternative to an Elasticsearch URL (`--url` argument).
`--cloud-id` and `--url` are mutually exclusive arguments.
## Changes
### Better logging
Switched from `print` statements to `logging` for a cleaner and more informative output - timestamps and log level are shown. The logging is now a bit more verbose, but it will help users to better understand what the script is doing.
### Add support for ES authentication using username/password or api key
Instead of being limited to passing credentials in the URL, there are now 2 additional methods:
- username/password using `--es-username` and `--es-password`
- API key using `--es-api-key`
Credentials can also be specified as environment variables with `ES_USERNAME`/`ES_PASSWORD` or `ES_API_KEY`
### Graceful handling of missing PyTorch requirements
In order to use the `eland_import_hub_model` script, PyTorch extras are required to be installed. If the user does not have the required packages installed, a helpful message is logged with a hint to install `eland[pytorch]` with `pip`.
### Graceful handling of already existing trained model
If a trained model with the same ID as the one we're trying to import already exists, and `--clear-previous` was not specified, we now log a clearer message about why the script can't proceed along with a hint to use the `--clear-previous` flag.
Prior to this change, we were letting the API exception seep through and the user was faced with a stack trace.
### `tqdm` added to main dependencies
If the user doesn't have `eland[pytorch]` extras installed, the first module to be reported as missing is `tqdm`. Since this module is [used in eland codebase](8294224e34/eland/ml/pytorch/_pytorch_model.py (L24)) directly, it makes sense to me to have it as part of the main set of requirements.
### Nit: Set tqdm unit to `parts` in `_pytorch_model.put_model`
The default unit is `it`, but `parts` better describes what the progress bar is tracking - uploading trained model definition parts.
In preparation for an 8.0 release, this updates PyTorch NLP dependencies
to more recent and latest minor versions. Amongst other things, this
introduces a fix from transformers that is helpful for text embedding
tasks with certain DPR models.
See: https://github.com/huggingface/transformers/issues/13670
Co-authored-by: Seth Michael Larson <seth.larson@elastic.co>