602 Commits

Author SHA1 Message Date
Dai Sugimori
bf3b092ed4
Add BertJapaneseTokenizer support with bert_ja tokenization configuration (#534)
See elasticsearch#95546
2023-06-23 08:14:27 +01:00
Seth Michael Larson
5fd1221815
Fix autosummary directive by removing hack autosummaries 2023-06-15 10:50:19 -05:00
Seth Michael Larson
17c1c2e9c7
Switch to the 'Furo' Sphinx theme 2023-06-15 09:51:14 -05:00
Benjamin Trent
8b327f60b8
[ML] add ability to upload xlm-roberta tokenized models (#518)
This allows XLMRoberta models to be uploaded to Elasticsearch.

blocked by: elastic/elasticsearch#94089
2023-06-14 07:59:28 -04:00
David Kyle
68a22a8001
Default the optional es_version parameter (#545) 2023-06-07 12:34:53 +01:00
Seth Michael Larson
afc7e41d6e
Update Dockerfile base image to use newer version 2023-06-02 14:20:01 -05:00
David Kyle
32ab988eb6
Tolerate different model output formats when measuring embedding size (#535)
Only add the embedding_size config option if the target Elasticsearch 
cluster version supports it
2023-05-25 12:25:31 -05:00
David Kyle
7ca8376f68
Add Elasticsearch 8.8 snapshot to test matrix (#543)
And increase the test ES node heap size to prevent circuit 
breaker exceptions due to better memory accounting in
elastic/elasticsearch#89437.
2023-05-24 11:59:41 +01:00
István Zoltán Szabó
e0c08e42a0
[DOCS] Adds instructions on model install in air-gapped env (#542)
Co-authored-by: David Kyle <david.kyle@elastic.co>
2023-05-24 12:53:04 +02:00
David Kyle
1e6f48f8f4
Generate valid NLP model id from file path (#541)
The eland_import_hub_model script supports uploading a local file where
the --hub-model-id argument is a file path. If the --es-model-id option is
not used the model Id is generated from the hub model id and when that 
is a file path the path must be converted to a valid elasticsearch model id.
2023-05-22 15:37:36 +01:00
David Kyle
7820a31256
Limit NumPy to a range of versions and note why (#540) 2023-05-22 10:47:06 +01:00
David Kyle
36bbbe0bdb
Upgrade torch to 1.13.1 and check the cluster version before uploading a NLP model. (#522)
PyTorch models traced in version 1.13 of PyTorch cannot be evaluated in 
version 1.9 or earlier. With this upgrade Eland becomes incompatible with
pre 8.7 Elasticsearch and will refuse to upload a model to the cluster. 
In this scenario either upgrade Elasticsearch or use an earlier version of Eland.
2023-05-19 16:29:38 +01:00
David Kyle
b507bb6d6c
Restrict NumPy and Pandas versions (#539)
Shap is incompatible with NumPy 1.24 due to a deprecated usage becoming
an error. There is no fix in Shap yet so an earlier version of NumPy must
be used.
Pandas 2.0 was recently released we will continue to use the latest 1.5 release 
to avoid any incompatibilities.
2023-05-19 16:04:33 +01:00
Seth Michael Larson
f7ea3bd476
Add a compatibility layer for Elasticsearch server 8.5.0 field_caps API 2023-05-02 15:40:20 -05:00
Seth Michael Larson
ca0cbe94ea
Fix readthedocs with Python 3.8 2023-05-02 12:21:57 -05:00
David Kyle
50d301f7cb
Set embedding_size config parameter for Text Embedding models (#532) 2023-04-25 11:41:14 +01:00
David Kyle
940f2a9bad
[NLP] Add support for the pass_through task #526 2023-04-06 15:43:00 +01:00
David Kyle
8e0d897171
[NLP] Prevent TypeError with None check (#525) 2023-04-03 14:56:19 +01:00
David Roberts
cebee6406f
Include pitfall of --start in the README (#506)
Users who follow the Eland README as a guide to importing
models can easily end up seeing inexplicably poor performance
due to unknowingly running the model with one allocation and
one thread per allocation.

This change spells out the effect of `--start` and links to
alternatives that allow better use of available hardware.

Co-authored-by: David Kyle <david.kyle@elastic.co>
2023-03-30 20:28:48 +01:00
Seth Michael Larson
44e04b4905
Release v8.7.0 v8.7.0 2023-03-30 14:00:02 -05:00
David Kyle
7f4687c791
[ML] Text expansion model config support (#520) 2023-03-08 15:40:14 +00:00
Benjamin Trent
d5578637cb
Choose text_embedding from auto when task type is unknown but its a sentence-transfomers model (#516)
closes https://github.com/elastic/eland/issues/514
2023-02-09 12:50:30 -05:00
Valeriy Khakhutskyy
0576114a1d
[ML] Export ML model as sklearn Pipeline (#509)
Closes #503

Note: I also had to fix the Sphinx version to 5.3.0 since, starting from 6.0, Sphinx suffers from a TypeError bug, which causes a CI failure.
2023-02-01 16:17:06 +01:00
Valeriy Khakhutskyy
2ea96322b3
Update to latest ES versions and fix unit tests (#512)
Update the test matrix to the latest Elasticsearch versions and fix the broken unit tests on the CI.
2023-01-31 20:55:29 +01:00
David Kyle
c55516f376
Fixes for two type hinting issues 2023-01-04 09:53:09 -06:00
David Kyle
211cc2c83f
Handle OSError for missing LightGBM dependency
Co-authored-by: Seth Michael Larson <seth.larson@elastic.co>
2022-11-02 11:32:27 -05:00
Benjamin Trent
82e34dbddb
Minor formatting fix for ML docs 2022-10-20 09:47:55 -05:00
Benjamin Trent
a8c8726634
[ML] add text_similarity task support (#486)
Adds text_similarity task support. This is a cross-encoder transformer task where both sequences are given to the transformer at once.

According to 🤗 (or at least how the cross-encoder models are concerned) this is a sequence classification task with just one classification "label". But really, it isn't labeled at all and is more akin to a regression model.

related: elastic/elasticsearch#88439
2022-08-01 09:04:34 -04:00
Benjamin Trent
11ea68a443
Add docker steps for eland model upload (#489) 2022-07-21 15:27:19 -04:00
István Zoltán Szabó
fbb01e5698
[DOCS] Adds important note about PyTorch version compatibility. (#487) 2022-07-13 12:41:35 +02:00
Seth Michael Larson
c97e69410d
Release v8.3.0 v8.3.0 2022-07-11 13:14:13 -05:00
David Kyle
0eb36faa5b
Restrict PyTorch version not to be more advanced than that used in Elasticsearch (#479)
Elasticsearch uses v1.11 of PyTorch. Models created with the latest PyTorch 
release (v1.12) are not compatible with v1.11. This pins the PyTorch version
to 1.11 to prevent the incompatibility. The version of the Elasticsearch Python
client is now required to be >= Eland.

All users of Eland for importing NLP models should upgrade.
2022-07-07 14:56:42 +01:00
Benjamin Trent
947d4d22a9
Update python example (#477) 2022-06-28 13:01:49 -04:00
David Kyle
23706e05b8
Add more exclusions to the dockerignore file 2022-06-28 10:34:02 -05:00
Benjamin Trent
8892f4fd64
[ML] adds new auto task type that attempts to automatically determine NLP task type from model config (#475)
For many model types, we don't need to require the task requested. We can infer the task type based on the model configuration and architecture. 

This commit makes the `task-type` parameter optional for the model up load script and adds logic for auto-detecting the task type based on the 🤗 model.
2022-06-23 08:32:23 -04:00
David Kyle
8448b3ba4e
Bump minimum PyTorch version to 1.11 2022-06-21 07:43:43 -05:00
David Kyle
081c8efaa0
Freeze the traced PyTorch model 2022-06-21 07:43:18 -05:00
Benjamin Trent
ec041ffdfd
[ML] ensure quantization is applied (#472) 2022-06-15 09:23:24 -04:00
Lisa Cawley
07af00c741
[DOCS] Include missing attributes (#468)
Co-authored-by: Seth Michael Larson <seth.larson@elastic.co>
2022-05-31 15:50:11 -07:00
Seth Michael Larson
bbe7a70cb9 Also pin traitlets 2022-05-31 14:28:36 -07:00
Seth Michael Larson
14821a8b09 Remove 'numpydoc' to stop reformatting 2022-05-31 14:28:36 -07:00
Seth Michael Larson
673065ee42 Stop explicitly pulling master 2022-05-31 14:28:36 -07:00
Lisa Cawley
845c055d7c
[DOCS] Adds question_answering task type for eland_import_hub_model 2022-05-31 14:37:51 -05:00
Nigel Small
a4838f4d22
Ignore type checking for agg_value 2022-05-31 09:23:15 -05:00
Lisa Cawley
09dd56c399
Add authentication methods for import model script (#466) 2022-05-18 07:44:37 -07:00
Benjamin Trent
fa30246937
[ML] fixes decision tree classifier upload to account for probabilities (#465)
This switches our sklearn.DecisionTreeClassifier serialization logic to account for multi-valued leaves in the tree.

The key difference between our inference and DecisionTreeClassifier, is that we run a softMax over the leaf where sklearn simply normalizes the results.

This means that our "probabilities" returned will be different than sklearn.
2022-05-17 08:11:20 -04:00
Seth Michael Larson
5bbb8e484a Release 8.2.0 v8.2.0 2022-05-11 06:38:21 -05:00
Benjamin Trent
650e02d16e
[ML] improve general pytorch model import and add tests (#463)
This improves the user consumed functions and classes for PyTorch NLP model upload to Elasticsearch.

Previously it was difficult to wrap your own module for uploading to Elasticsearch.

This commit splits some classes out, adds new ones, and adds tests showing how to wrap some simple modules.
2022-05-05 10:50:53 -04:00
Benjamin Trent
70fadc9986
[ML] add support for question_answering NLP tasks (#457)
Adds support for `question_answering` NLP models within the pytorch model uploader.

Related: https://github.com/elastic/elasticsearch/pull/85958
2022-05-04 13:15:33 -04:00
Benjamin Trent
afe08f8107
[ML] Improve NLP model import by using nicely defined types (#459)
This adds some more definite types for our NLP tasks and tokenization configurations.

This is the first step in allowing users to more easily import their own transformer models via something other than hugging face.
2022-05-03 15:19:03 -04:00