355 Commits

Author SHA1 Message Date
Benjamin Trent
afe08f8107
[ML] Improve NLP model import by using nicely defined types (#459)
This adds some more definite types for our NLP tasks and tokenization configurations.

This is the first step in allowing users to more easily import their own transformer models via something other than hugging face.
2022-05-03 15:19:03 -04:00
David Olaru
fe3422100c
Hub model import script improvements (#461)
## Changes 
### Better logging
Switched from `print` statements to `logging` for a cleaner and more informative output - timestamps and log level are shown. The logging is now a bit more verbose, but it will help users to better understand what the script is doing.

### Add support for ES authentication using username/password or api key
Instead of being limited to passing credentials in the URL, there are now 2 additional methods:
- username/password using `--es-username` and `--es-password`
- API key using `--es-api-key`

Credentials can also be specified as environment variables with `ES_USERNAME`/`ES_PASSWORD` or `ES_API_KEY`

### Graceful handling of missing PyTorch requirements
In order to use the `eland_import_hub_model` script, PyTorch extras are required to be installed. If the user does not have the required packages installed, a helpful message is logged with a hint to install `eland[pytorch]` with `pip`.

### Graceful handling of already existing trained model
If a trained model with the same ID as the one we're trying to import already exists, and `--clear-previous` was not specified, we now log a clearer message about why the script can't proceed along with a hint to use the `--clear-previous` flag. 

Prior to this change, we were letting the API exception seep through and the user was faced with a stack trace.

### `tqdm` added to main dependencies
If the user doesn't have `eland[pytorch]` extras installed, the first module to be reported as missing is `tqdm`. Since this module is [used in eland codebase](8294224e34/eland/ml/pytorch/_pytorch_model.py (L24)) directly, it makes sense to me to have it as part of the main set of requirements.

### Nit: Set tqdm unit to `parts` in `_pytorch_model.put_model`
The default unit is `it`, but `parts` better describes what the progress bar is tracking - uploading trained model definition parts.
2022-04-27 15:13:58 +01:00
Benjamin Trent
8294224e34
[ML] Fix XGBoost model import for xgboost>=1.6 2022-04-20 09:20:50 -05:00
Seth Michael Larson
cb839a9ac9
Release 8.1.0 2022-03-31 17:12:26 -05:00
P. Sai Vinay
76a52b7947
Add support for eland.Series.unqiue() 2022-03-31 08:33:15 -05:00
Benjamin Trent
15a3007288
[ML] add roberta bart transformer upload support (#443)
Related to: https://github.com/elastic/elasticsearch/pull/84777

This allows BART and RoBERTa models to be uploaded to Elasticsearch for our currently defined NLP tasks.
2022-03-14 12:26:12 -04:00
David Kyle
5678525b15
Fix mypy type errors for elasticsearch-python v8.0.0 2022-03-08 17:50:39 -06:00
Seth Michael Larson
abd05df50b
Release 8.0.0 2022-02-10 14:29:54 -06:00
Ashton Sidhu
e3bff8a623
Add option to disable schema enforcement for pandas_to_eland 2022-01-14 07:35:58 -06:00
Benjamin Trent
72856e2c3f
[ML] Add support for MPNet PyTorch models 2022-01-10 11:21:30 -06:00
Ashton Sidhu
64daa07a65
Using the 'date' field for datetime64+timezone columns 2022-01-04 22:03:49 -06:00
Florian Winkler
3db93cd789
Allow using datetime types in filters 2022-01-04 14:46:18 -06:00
Seth Michael Larson
c14bc24032
Release 8.0.0-beta1 2021-12-16 07:42:38 -06:00
Seth Michael Larson
cd0897f5d7
Add a warning when connecting to incompatible Elasticsearch versions 2021-12-15 14:08:20 -06:00
Seth Michael Larson
109387184a
Support the v8.0 Elasticsearch client 2021-12-09 15:01:26 -06:00
Seth Michael Larson
4e489de424
Bump version to 8.0.0 2021-12-02 08:41:11 -06:00
Josh Devins
5bc1a824a7
Add PyTorch modules to noxfile
We added the `pytorch` module which is type checked but was not in the
noxfile as such. This change also addresses type errors that arose after
adding type checking.
2021-11-29 08:03:25 -08:00
Josh Devins
7209f61773
Adds max_length padding to transformer tracing (#411)
The padding parameter needs to be set on the tokenization call and not
in the constructor. Furthermore, the True value will only pad to the
largest input in a batch, however we don't trace with batches so this
value had no effect. The proper place to pass this parameter is in the
tokenization call itself and the proper value to use is "max_length"
which will pad the input to the maximum input size specified by the
model. Although we measure no functional or performance impact of this
setting, it has been suggested that this is a best practice.

See: https://huggingface.co/transformers/serialization.html#dummy-inputs-and-standard-lengths
2021-11-11 13:18:55 +01:00
Benjamin Trent
a3b0907c5b
[ML] Add inference results tests for PyTorch transformer models 2021-11-10 06:50:10 -06:00
Seth Michael Larson
19014f1227
Avoid DeprecationWarnings when using the new Elasticsearch client (7.15+) 2021-10-28 09:24:36 -05:00
P. Sai Vinay
704c8982bc
Optimize to_pandas() internally to improve performance
Co-authored-by: Seth Michael Larson <seth.larson@elastic.co>
2021-10-13 13:23:04 -05:00
P. Sai Vinay
f9d2defb1b
Add number_samples to sklearn MLModel 2021-10-07 08:14:54 -05:00
Josh Devins
014943d3b8
Add initial implementation of PyTorch ML models 2021-10-06 08:44:40 -05:00
P. Sai Vinay
995f2432b6
Add number_samples to LightGBM MLModel and leaf_count to leaf nodes
* Add number_samples to lightgbm ML Model

* Add leaf_count for leaf nodes
2021-09-30 08:13:44 -05:00
P. Sai Vinay
dabb327b8b
Refactor df.info() for better readability 2021-09-28 15:12:29 -05:00
P. Sai Vinay
f241ae971a
Add flynt and --cov-report=term-missing 2021-09-21 11:18:01 -05:00
Seth Michael Larson
7aabc88e4a
Rename 'master' branch to 'main' 2021-09-08 11:51:50 -05:00
Jabin Kong
77f9a455e9
Fix docstring formatting 2021-09-07 11:40:19 -05:00
P. Sai Vinay
315f94b201
Add excluded lines for coverage and improve coverage 2021-09-07 11:39:19 -05:00
Seth Michael Larson
a50c3657c4
Release v7.14.1b1 2021-08-30 13:42:55 -05:00
Jabin Kong
1aa193da9e
Add iterrows() and itertuples() to DataFrame
Co-authored-by: Seth Michael Larson <seth.larson@elastic.co>
2021-08-20 08:34:52 -05:00
Seth Michael Larson
e4f88a34a6
Yield list of hits from _search_yield_hits() instead of individual hits 2021-08-17 12:16:10 -05:00
P. Sai Vinay
011bf29816
Simplify ES->pandas logic by removing Collectors 2021-08-16 12:22:02 -05:00
Seth Michael Larson
76d83ea47f
Bump version to 7.14.0b1 2021-08-09 09:21:49 -05:00
Seth Michael Larson
15ba8d3e02
Fallback on using scroll searches for Elasticsearch <7.12
PIT+search_after became universally safe in Elasticsearch 7.12 by adding an automatic sort tiebreaker field when using PITs called `_shard_doc` but now we need to do feature detection to make sure we use the previous scroll method on Elasticsearch <7.12 clusters
2021-08-08 12:19:41 -05:00
P. Sai Vinay
30876c8899
Switch to Point-in-Time with search_after instead of using scroll APIs
Co-authored-by: Seth Michael Larson <seth.larson@elastic.co>
2021-08-07 16:05:33 -05:00
P. Sai Vinay
d3f8d7b8f6
Optimize FieldMappings.aggregate_field_name() method 2021-08-06 11:27:59 -05:00
P. Sai Vinay
823f01cc6c
Add type hints to 'eland.operations' and 'eland.ndframe' 2021-08-02 11:50:35 -05:00
P. Sai Vinay
4c1af42c14
Add idxmax and idxmin methods to DataFrame 2021-07-28 07:55:26 -05:00
Seth Michael Larson
c74fccbd74
Drop support for Python 3.6, pandas<1.2 2021-07-27 14:43:03 -05:00
P. Sai Vinay
193bcb73ef
Add support for Pandas v1.3 and LightGBM v3.x 2021-07-27 11:01:35 -05:00
Seth Michael Larson
1555ea9534
Fix typo in version number
Should be `7.13.0b1` instead of `7.13.1b1`
2021-06-22 12:03:46 -05:00
Seth Michael Larson
16178dfb5d
Release 7.13.0b1 2021-06-22 11:59:27 -05:00
P. Sai Vinay
ac2efb5863
Optimize df.describe() to use aggregations instead of own query 2021-06-22 11:29:54 -05:00
P. Sai Vinay
5fe32a24df
Add quantile() to DataFrameGroupBy 2021-06-22 10:54:33 -05:00
P. Sai Vinay
7e8520a8ef
Remove deprecated code in XGBoost and test suite 2021-06-08 15:19:56 -05:00
P. Sai Vinay
e9c0b897f5
Add quantile() to DataFrame and Series 2021-06-08 13:02:44 -05:00
P. Sai Vinay
aa9d60e7e7
Add sort order to groupby dropna=False (#322)
* Add sort order to groupby dropna=False

* Fix rebase
2021-04-21 13:24:52 +00:00
Stephen Dodson
1040160451
Fix bugs with field mapping and lint issue (#346)
* Fix bugs with field mapping:

1. If no permission to call _mapping, return readable error
2. If index is wildcard, fix issues with user warnings

* Fixing lint issues

* Removing trailing backslashes in doc

* Remove pandas/matplotlib deprecation warning

This warning is due to a conflict between
pandas/matplotlib that may be resolved in a later
version. For now, ignore the warning so CI works.
2021-03-30 14:49:54 +00:00
Seth Michael Larson
985afe74e0
Release 7.10.1b1 2021-01-12 12:36:23 -06:00