73 Commits

Author SHA1 Message Date
David Kyle
f5c2dcfc9d
Remove version checks in test (#792) 2025-04-30 16:24:05 +04:00
Mark J. Hoy
ec45c395fd
add 9.0.1 for LTR rescoring (#790) 2025-04-25 08:19:23 -04:00
Mark J. Hoy
51a2b9cc19
Add 9.1.0 Snapshot to Build and Fix test_ml_model Tests to Normalized Expected Scores if Min Score is Less Than Zero (#777)
* normalized expected scores if min is < 0

* only normalize scores for ES after 8.19+ / 9.1+

* add 9.1.0 snapshot to build matrix

* get min score from booster trees

* removing typing on function definition

* properly flatten our tree leaf scores

* simplify getting min score

* debugging messages

* get all the matches in better way

* Fix model score normalization.

* lint

* lint again

* lint; correct return for bounds map/list

* revert to Aurelian's fix

* re-lint :/

---------

Co-authored-by: Aurelien FOUCRET <aurelien.foucret@elastic.co>
2025-04-23 15:53:32 +00:00
David Kyle
a9c36927f6
Fix tokeniser for DeBERTa models (#769) 2025-04-23 09:10:02 +01:00
Bart Broere
75c57b0775
Support Pandas 2 (#742)
* Fix test setup to match pandas 2.0 demands

* Use the now deprecated _append method

(Better solution might exist)

* Deal with numeric_only being removed in metrics test

* Skip mad metric for other pandas versions

* Account for differences between pandas versions in describe methods

* Run black

* Check Pandas version first

* Mirror behaviour of installed Pandas version when running value_counts

* Allow passing arguments to the individual asserters

* Fix for method _construct_axes_from_arguments no longer existing

* Skip mad metric if it does not exist

* Account for pandas 2.0 timestamp default behaviour

* Deal with empty vs other inferred data types

* Account for default datetime precision change

* Run Black

* Solution for differences in inferred_type only

* Fix csv and json issues

* Skip two doctests

* Passing a set as indexer is no longer allowed

* Don't validate output where it differs between Pandas versions in the environment

* Update test matrix and packaging metadata

* Update version of Python in the docs

* Update Python version in demo notebook

* Match noxfile

* Symmetry

* Fix trailing comma in JSON

* Revert some changes in setup.py to fix building the documentation

* Revert "Revert some changes in setup.py to fix building the documentation"

This reverts commit ea9879753129d8d8390b3cbbce57155a8b4fb346.

* Use PANDAS_VERSION from eland.common

* Still skip the doctest, but make the output pandas 2 instead of 1

* Still skip doctest, but switch to pandas 2 output

* Prepare for pandas 3

* Reference the right column

* Ignore output in tests but switch to pandas 2 output

* Add line comment about NBVAL_IGNORE_OUTPUT

* Restore missing line and add stderr cell

* Use non-private method instead

* Fix indentation and parameter issues

* If index is not specified, and pandas 1 is present, set it to True

From pandas 2 and upwards, index is set to None by default

* Run black

* Newer version of black might have different opinions?

* Add line comment

* Remove unused import

* Add reason for ignore statement

* Add reason for skip

---------

Co-authored-by: Quentin Pradet <quentin.pradet@elastic.co>
2025-02-04 17:43:43 +04:00
Valeriy Khakhutskyy
77589b26b8
Remove ML model export as sklearn Pipeline and clean up code (#744)
* Revert "[ML] Export ML model as sklearn Pipeline (#509)"

This reverts commit 0576114a1d886eafabca3191743a9bea9dc20b1a.

* Keep useful changes

* formatting

* Remove obsolete test matrix configuration and update version references in documentation and Noxfile

* formatting

---------

Co-authored-by: Quentin Pradet <quentin.pradet@elastic.co>
2025-02-04 11:36:50 +04:00
David Kyle
5253501704
Upgrade PyTorch to version 2.3.1 (#718)
Upgrades the PyTorch, transformers and sentence transformer requirements.
Elasticsearch has upgraded to PyTorch to 2.3.1 in 8.16 and 8.15.2. For 
compatibility reasons Eland will refuse to upload to an Elasticsearch cluster 
that has is using an earlier version of PyTorch.
2024-09-30 10:22:02 +01:00
David Kyle
fd8886da6a
Default truncation to second for text similarity the task type(#713)
In reranking the first input (the query) is generally shorter. In this case
it makes more sense to truncate the second input (the document text)
2024-08-05 11:47:15 +01:00
Aurélien FOUCRET
bee6d0e1f7
Remove input fields from exported LTR models (#708) 2024-07-05 14:31:22 +02:00
Bart Broere
1014ecdb39
Fix non _source fields missing from the result hits (#693) 2024-06-10 11:09:52 +04:00
Aurélien FOUCRET
9cea2385e6
Work around LTR model cache in tests (#685) 2024-04-08 14:00:36 +04:00
David Kyle
ae0bba34c6
Upgrade torch to 2.1.2 (#671)
Compatible with Elasticsearch 8.13 where the same upgrade has been made
2024-03-26 10:06:50 +00:00
David Kyle
8e8c49ddbf
Mute the Learning to Rank tests (#676) 2024-03-21 10:13:31 +00:00
David Kyle
5d34dc3cc4
Add override option to specify the model's max input size(#674)
If the max input size cannot be found in the configuration the user
can specify it as a parameter to the eland_import_hub_model script
2024-03-20 10:02:43 +00:00
Bart Broere
33cf029efe
Implement eland.DataFrame.to_json (#661)
Co-authored-by: Quentin Pradet <quentin.pradet@elastic.co>
2024-02-15 11:32:54 +04:00
Quentin Pradet
02190e74e7
Switch to 2024 black style (#657) 2024-01-31 14:47:19 +04:00
Aurélien FOUCRET
2a6a4b1f06
Fix missing value support for XGBRanker. (#654)
* Fix missing value support for XGBRanker.

* lint

* Sort expected scores

* lint
2024-01-23 18:42:24 +01:00
David Kyle
64216d44fb
Add prefix_string config option to the import model hub script (#642) 2024-01-19 12:06:57 +04:00
Aurélien FOUCRET
5169cc926a
Improve LTR (#651)
* Ensure the feature logger is using NaN for non matching query feature extractors (consistent with ES).

* Default score is None instead of 0.

* LTR model import API improvements.

* Fix feature logger tests.

* Fix export in eland.ml.ltr

* Apply suggestions from code review

Co-authored-by: Adam Demjen <demjened@gmail.com>

* Fix supported models for LTR

---------

Co-authored-by: Adam Demjen <demjened@gmail.com>
2024-01-17 13:01:47 +04:00
Aurélien FOUCRET
d3ed669a5e
LTR feature logger (#648) 2024-01-12 13:52:04 +01:00
Adam Demjen
926f0b9b5c
Add XGBRanker and transformer (#649)
* Add XGBRanker and transformer

* Map XGBoostRegressorTransformer to XGBRanker

* Add unit tests

* Remove unused import

* Revert addition of type

* Update function comment

* Distinguish objective based on model class
2024-01-11 15:48:13 -05:00
Adam Demjen
840871f9d9
Accept LTR inference config when creating model (#645)
* Support for supplying inference_config

* Fix linting errors

* Add unit test

* Add LTR type, throw exception on predict, refine test

* Add search step to LTR test

* Fix linter errors

* Update rescoring assertion in test + type defs

* Fix linting error

* Remove failing assertion
2024-01-08 09:19:03 -05:00
Aurélien FOUCRET
05c5859b8a
Adding a new movie dataset to the tests. (#646) 2024-01-04 16:14:56 +01:00
David Kyle
081250cdec
Fix failed import of ST RoBERTa models (#637)
Fixes an error uploading the sentence-transformers/all-distilroberta-v1 model
which failed with "missing 2 required positional arguments: 'token_type_ids' 
and 'position_ids'". The cause was that the tokenizer type was not recognised 
due to a typo
2023-11-21 12:53:43 +00:00
David Kyle
b689759278
Skip model config tests (#635)
For #633
2023-11-21 11:07:55 +00:00
Valeriy Khakhutskyy
6cecb454e3
[ML] Better memory estimation for NLP models (#568)
This PR adds an ability to estimate per deployment and per allocation memory usage of NLP transformer models. It uses torch.profiler and performs logs the peak memory usage during the inference.

This information is then used in Elasticsearch to provision models with sufficient memory (elastic/elasticsearch#98874).
2023-11-06 12:18:20 +01:00
Bart Broere
5e5f36bdf8
Deal with the mad aggregation being removed in Pandas 2 (#602) 2023-11-06 06:12:16 +01:00
David Kyle
5b3a83e7f2
[NLP] Support E5 small multi-lingual (#625)
Although E5 small is a BERT based model it takes 2 parameters to forward
not 4. Use the tokenizer type to decide the number of parameters
2023-10-31 17:49:43 +00:00
David Kyle
ab6e44f430
[NLP] Tests for NLP model configurations (#623)
Add tests for generated Elasticsearch model configurations
2023-10-19 12:39:57 +01:00
Bart Broere
36b941e336
Use _append instead of append since it's still available after 2.0 of pandas (#603) 2023-10-11 15:41:05 +01:00
Quentin Pradet
c6ce4b2c46
Fix direct usage of TransformerModel (#619) 2023-10-11 11:56:14 +02:00
Bart Broere
3908f43905
Remove deprecated check_less_precise (#596) 2023-09-26 07:34:52 +02:00
Enrico Zimuel
bb59a4f8d6
Fixed conf test with isinstance 2023-08-22 13:23:23 +02:00
Valeriy Khakhutskyy
f38de0ed05
Fix failing unit tests (#558)
I updated the tree serialization format for the new scikit learn versions. I also updated the minimum requirement of scikit learn to 1.3 to ensure compatibility.

Fixes #555
2023-07-10 15:15:58 +02:00
David Kyle
32ab988eb6
Tolerate different model output formats when measuring embedding size (#535)
Only add the embedding_size config option if the target Elasticsearch 
cluster version supports it
2023-05-25 12:25:31 -05:00
David Kyle
1e6f48f8f4
Generate valid NLP model id from file path (#541)
The eland_import_hub_model script supports uploading a local file where
the --hub-model-id argument is a file path. If the --es-model-id option is
not used the model Id is generated from the hub model id and when that 
is a file path the path must be converted to a valid elasticsearch model id.
2023-05-22 15:37:36 +01:00
David Kyle
36bbbe0bdb
Upgrade torch to 1.13.1 and check the cluster version before uploading a NLP model. (#522)
PyTorch models traced in version 1.13 of PyTorch cannot be evaluated in 
version 1.9 or earlier. With this upgrade Eland becomes incompatible with
pre 8.7 Elasticsearch and will refuse to upload a model to the cluster. 
In this scenario either upgrade Elasticsearch or use an earlier version of Eland.
2023-05-19 16:29:38 +01:00
Seth Michael Larson
f7ea3bd476
Add a compatibility layer for Elasticsearch server 8.5.0 field_caps API 2023-05-02 15:40:20 -05:00
David Kyle
50d301f7cb
Set embedding_size config parameter for Text Embedding models (#532) 2023-04-25 11:41:14 +01:00
David Kyle
7f4687c791
[ML] Text expansion model config support (#520) 2023-03-08 15:40:14 +00:00
Benjamin Trent
d5578637cb
Choose text_embedding from auto when task type is unknown but its a sentence-transfomers model (#516)
closes https://github.com/elastic/eland/issues/514
2023-02-09 12:50:30 -05:00
Valeriy Khakhutskyy
0576114a1d
[ML] Export ML model as sklearn Pipeline (#509)
Closes #503

Note: I also had to fix the Sphinx version to 5.3.0 since, starting from 6.0, Sphinx suffers from a TypeError bug, which causes a CI failure.
2023-02-01 16:17:06 +01:00
Valeriy Khakhutskyy
2ea96322b3
Update to latest ES versions and fix unit tests (#512)
Update the test matrix to the latest Elasticsearch versions and fix the broken unit tests on the CI.
2023-01-31 20:55:29 +01:00
Benjamin Trent
8892f4fd64
[ML] adds new auto task type that attempts to automatically determine NLP task type from model config (#475)
For many model types, we don't need to require the task requested. We can infer the task type based on the model configuration and architecture. 

This commit makes the `task-type` parameter optional for the model up load script and adds logic for auto-detecting the task type based on the 🤗 model.
2022-06-23 08:32:23 -04:00
Benjamin Trent
fa30246937
[ML] fixes decision tree classifier upload to account for probabilities (#465)
This switches our sklearn.DecisionTreeClassifier serialization logic to account for multi-valued leaves in the tree.

The key difference between our inference and DecisionTreeClassifier, is that we run a softMax over the leaf where sklearn simply normalizes the results.

This means that our "probabilities" returned will be different than sklearn.
2022-05-17 08:11:20 -04:00
Benjamin Trent
650e02d16e
[ML] improve general pytorch model import and add tests (#463)
This improves the user consumed functions and classes for PyTorch NLP model upload to Elasticsearch.

Previously it was difficult to wrap your own module for uploading to Elasticsearch.

This commit splits some classes out, adds new ones, and adds tests showing how to wrap some simple modules.
2022-05-05 10:50:53 -04:00
Benjamin Trent
afe08f8107
[ML] Improve NLP model import by using nicely defined types (#459)
This adds some more definite types for our NLP tasks and tokenization configurations.

This is the first step in allowing users to more easily import their own transformer models via something other than hugging face.
2022-05-03 15:19:03 -04:00
P. Sai Vinay
76a52b7947
Add support for eland.Series.unqiue() 2022-03-31 08:33:15 -05:00
Ashton Sidhu
e3bff8a623
Add option to disable schema enforcement for pandas_to_eland 2022-01-14 07:35:58 -06:00
Benjamin Trent
72856e2c3f
[ML] Add support for MPNet PyTorch models 2022-01-10 11:21:30 -06:00