285 Commits

Author SHA1 Message Date
Valeriy Khakhutskyy
77781b90ff
[ML] Update trained model inference endpoint (#556)
Infer trained model deployment API has been deprecated, so I changed the code to use the new one.
2023-07-11 10:55:11 +02:00
Valeriy Khakhutskyy
f38de0ed05
Fix failing unit tests (#558)
I updated the tree serialization format for the new scikit learn versions. I also updated the minimum requirement of scikit learn to 1.3 to ensure compatibility.

Fixes #555
2023-07-10 15:15:58 +02:00
Youhei Sakurai
55967a7324
Minimize if main section (#554)
For migration from scripts to console_scripts in setup.py,
the current long if __name__ == "__main__": section is a 
blocker because the console_scripts requires to specify a
function as an entrypoint.
Move the logic into a main() function.
2023-07-05 10:49:16 +01:00
Dai Sugimori
bf3b092ed4
Add BertJapaneseTokenizer support with bert_ja tokenization configuration (#534)
See elasticsearch#95546
2023-06-23 08:14:27 +01:00
Benjamin Trent
8b327f60b8
[ML] add ability to upload xlm-roberta tokenized models (#518)
This allows XLMRoberta models to be uploaded to Elasticsearch.

blocked by: elastic/elasticsearch#94089
2023-06-14 07:59:28 -04:00
David Kyle
68a22a8001
Default the optional es_version parameter (#545) 2023-06-07 12:34:53 +01:00
David Kyle
32ab988eb6
Tolerate different model output formats when measuring embedding size (#535)
Only add the embedding_size config option if the target Elasticsearch 
cluster version supports it
2023-05-25 12:25:31 -05:00
David Kyle
1e6f48f8f4
Generate valid NLP model id from file path (#541)
The eland_import_hub_model script supports uploading a local file where
the --hub-model-id argument is a file path. If the --es-model-id option is
not used the model Id is generated from the hub model id and when that 
is a file path the path must be converted to a valid elasticsearch model id.
2023-05-22 15:37:36 +01:00
Seth Michael Larson
f7ea3bd476
Add a compatibility layer for Elasticsearch server 8.5.0 field_caps API 2023-05-02 15:40:20 -05:00
David Kyle
50d301f7cb
Set embedding_size config parameter for Text Embedding models (#532) 2023-04-25 11:41:14 +01:00
David Kyle
940f2a9bad
[NLP] Add support for the pass_through task #526 2023-04-06 15:43:00 +01:00
David Kyle
8e0d897171
[NLP] Prevent TypeError with None check (#525) 2023-04-03 14:56:19 +01:00
Seth Michael Larson
44e04b4905
Release v8.7.0 2023-03-30 14:00:02 -05:00
David Kyle
7f4687c791
[ML] Text expansion model config support (#520) 2023-03-08 15:40:14 +00:00
Benjamin Trent
d5578637cb
Choose text_embedding from auto when task type is unknown but its a sentence-transfomers model (#516)
closes https://github.com/elastic/eland/issues/514
2023-02-09 12:50:30 -05:00
Valeriy Khakhutskyy
0576114a1d
[ML] Export ML model as sklearn Pipeline (#509)
Closes #503

Note: I also had to fix the Sphinx version to 5.3.0 since, starting from 6.0, Sphinx suffers from a TypeError bug, which causes a CI failure.
2023-02-01 16:17:06 +01:00
Valeriy Khakhutskyy
2ea96322b3
Update to latest ES versions and fix unit tests (#512)
Update the test matrix to the latest Elasticsearch versions and fix the broken unit tests on the CI.
2023-01-31 20:55:29 +01:00
David Kyle
c55516f376
Fixes for two type hinting issues 2023-01-04 09:53:09 -06:00
David Kyle
211cc2c83f
Handle OSError for missing LightGBM dependency
Co-authored-by: Seth Michael Larson <seth.larson@elastic.co>
2022-11-02 11:32:27 -05:00
Benjamin Trent
a8c8726634
[ML] add text_similarity task support (#486)
Adds text_similarity task support. This is a cross-encoder transformer task where both sequences are given to the transformer at once.

According to 🤗 (or at least how the cross-encoder models are concerned) this is a sequence classification task with just one classification "label". But really, it isn't labeled at all and is more akin to a regression model.

related: elastic/elasticsearch#88439
2022-08-01 09:04:34 -04:00
Seth Michael Larson
c97e69410d
Release v8.3.0 2022-07-11 13:14:13 -05:00
David Kyle
0eb36faa5b
Restrict PyTorch version not to be more advanced than that used in Elasticsearch (#479)
Elasticsearch uses v1.11 of PyTorch. Models created with the latest PyTorch 
release (v1.12) are not compatible with v1.11. This pins the PyTorch version
to 1.11 to prevent the incompatibility. The version of the Elasticsearch Python
client is now required to be >= Eland.

All users of Eland for importing NLP models should upgrade.
2022-07-07 14:56:42 +01:00
Benjamin Trent
8892f4fd64
[ML] adds new auto task type that attempts to automatically determine NLP task type from model config (#475)
For many model types, we don't need to require the task requested. We can infer the task type based on the model configuration and architecture. 

This commit makes the `task-type` parameter optional for the model up load script and adds logic for auto-detecting the task type based on the 🤗 model.
2022-06-23 08:32:23 -04:00
David Kyle
081c8efaa0
Freeze the traced PyTorch model 2022-06-21 07:43:18 -05:00
Benjamin Trent
ec041ffdfd
[ML] ensure quantization is applied (#472) 2022-06-15 09:23:24 -04:00
Nigel Small
a4838f4d22
Ignore type checking for agg_value 2022-05-31 09:23:15 -05:00
Benjamin Trent
fa30246937
[ML] fixes decision tree classifier upload to account for probabilities (#465)
This switches our sklearn.DecisionTreeClassifier serialization logic to account for multi-valued leaves in the tree.

The key difference between our inference and DecisionTreeClassifier, is that we run a softMax over the leaf where sklearn simply normalizes the results.

This means that our "probabilities" returned will be different than sklearn.
2022-05-17 08:11:20 -04:00
Seth Michael Larson
5bbb8e484a Release 8.2.0 2022-05-11 06:38:21 -05:00
Benjamin Trent
650e02d16e
[ML] improve general pytorch model import and add tests (#463)
This improves the user consumed functions and classes for PyTorch NLP model upload to Elasticsearch.

Previously it was difficult to wrap your own module for uploading to Elasticsearch.

This commit splits some classes out, adds new ones, and adds tests showing how to wrap some simple modules.
2022-05-05 10:50:53 -04:00
Benjamin Trent
70fadc9986
[ML] add support for question_answering NLP tasks (#457)
Adds support for `question_answering` NLP models within the pytorch model uploader.

Related: https://github.com/elastic/elasticsearch/pull/85958
2022-05-04 13:15:33 -04:00
Benjamin Trent
afe08f8107
[ML] Improve NLP model import by using nicely defined types (#459)
This adds some more definite types for our NLP tasks and tokenization configurations.

This is the first step in allowing users to more easily import their own transformer models via something other than hugging face.
2022-05-03 15:19:03 -04:00
David Olaru
fe3422100c
Hub model import script improvements (#461)
## Changes 
### Better logging
Switched from `print` statements to `logging` for a cleaner and more informative output - timestamps and log level are shown. The logging is now a bit more verbose, but it will help users to better understand what the script is doing.

### Add support for ES authentication using username/password or api key
Instead of being limited to passing credentials in the URL, there are now 2 additional methods:
- username/password using `--es-username` and `--es-password`
- API key using `--es-api-key`

Credentials can also be specified as environment variables with `ES_USERNAME`/`ES_PASSWORD` or `ES_API_KEY`

### Graceful handling of missing PyTorch requirements
In order to use the `eland_import_hub_model` script, PyTorch extras are required to be installed. If the user does not have the required packages installed, a helpful message is logged with a hint to install `eland[pytorch]` with `pip`.

### Graceful handling of already existing trained model
If a trained model with the same ID as the one we're trying to import already exists, and `--clear-previous` was not specified, we now log a clearer message about why the script can't proceed along with a hint to use the `--clear-previous` flag. 

Prior to this change, we were letting the API exception seep through and the user was faced with a stack trace.

### `tqdm` added to main dependencies
If the user doesn't have `eland[pytorch]` extras installed, the first module to be reported as missing is `tqdm`. Since this module is [used in eland codebase](8294224e34/eland/ml/pytorch/_pytorch_model.py (L24)) directly, it makes sense to me to have it as part of the main set of requirements.

### Nit: Set tqdm unit to `parts` in `_pytorch_model.put_model`
The default unit is `it`, but `parts` better describes what the progress bar is tracking - uploading trained model definition parts.
2022-04-27 15:13:58 +01:00
Benjamin Trent
8294224e34
[ML] Fix XGBoost model import for xgboost>=1.6 2022-04-20 09:20:50 -05:00
Seth Michael Larson
cb839a9ac9
Release 8.1.0 2022-03-31 17:12:26 -05:00
P. Sai Vinay
76a52b7947
Add support for eland.Series.unqiue() 2022-03-31 08:33:15 -05:00
Benjamin Trent
15a3007288
[ML] add roberta bart transformer upload support (#443)
Related to: https://github.com/elastic/elasticsearch/pull/84777

This allows BART and RoBERTa models to be uploaded to Elasticsearch for our currently defined NLP tasks.
2022-03-14 12:26:12 -04:00
David Kyle
5678525b15
Fix mypy type errors for elasticsearch-python v8.0.0 2022-03-08 17:50:39 -06:00
Seth Michael Larson
abd05df50b
Release 8.0.0 2022-02-10 14:29:54 -06:00
Ashton Sidhu
e3bff8a623
Add option to disable schema enforcement for pandas_to_eland 2022-01-14 07:35:58 -06:00
Benjamin Trent
72856e2c3f
[ML] Add support for MPNet PyTorch models 2022-01-10 11:21:30 -06:00
Ashton Sidhu
64daa07a65
Using the 'date' field for datetime64+timezone columns 2022-01-04 22:03:49 -06:00
Florian Winkler
3db93cd789
Allow using datetime types in filters 2022-01-04 14:46:18 -06:00
Seth Michael Larson
c14bc24032
Release 8.0.0-beta1 2021-12-16 07:42:38 -06:00
Seth Michael Larson
cd0897f5d7
Add a warning when connecting to incompatible Elasticsearch versions 2021-12-15 14:08:20 -06:00
Seth Michael Larson
109387184a
Support the v8.0 Elasticsearch client 2021-12-09 15:01:26 -06:00
Seth Michael Larson
4e489de424
Bump version to 8.0.0 2021-12-02 08:41:11 -06:00
Josh Devins
5bc1a824a7
Add PyTorch modules to noxfile
We added the `pytorch` module which is type checked but was not in the
noxfile as such. This change also addresses type errors that arose after
adding type checking.
2021-11-29 08:03:25 -08:00
Josh Devins
7209f61773
Adds max_length padding to transformer tracing (#411)
The padding parameter needs to be set on the tokenization call and not
in the constructor. Furthermore, the True value will only pad to the
largest input in a batch, however we don't trace with batches so this
value had no effect. The proper place to pass this parameter is in the
tokenization call itself and the proper value to use is "max_length"
which will pad the input to the maximum input size specified by the
model. Although we measure no functional or performance impact of this
setting, it has been suggested that this is a best practice.

See: https://huggingface.co/transformers/serialization.html#dummy-inputs-and-standard-lengths
2021-11-11 13:18:55 +01:00
Benjamin Trent
a3b0907c5b
[ML] Add inference results tests for PyTorch transformer models 2021-11-10 06:50:10 -06:00
Seth Michael Larson
19014f1227
Avoid DeprecationWarnings when using the new Elasticsearch client (7.15+) 2021-10-28 09:24:36 -05:00