Add more sections for elastic.co/guide

This commit is contained in:
Seth Michael Larson 2021-01-12 10:26:01 -06:00 committed by GitHub
parent 421d84fd20
commit 26354622b5
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
6 changed files with 136 additions and 5 deletions

View File

@ -163,7 +163,7 @@ to be serialized and used as an inference model in Elasticsearch
[0 1 1 0 1 0 0 0 1 0]
# Import the model into Elasticsearch
>>> es_model = MLModel(
>>> es_model = MLModel.import_model(
es_client="localhost:9200",
model_id="xgb-classifier",
model=xgb_model,

View File

@ -0,0 +1,62 @@
[[dataframes]]
== Data Frames
`eland.DataFrame` wraps an Elasticsearch index in a Pandas-like API
and defers all processing and filtering of data to Elasticsearch
instead of your local machine. This means you can process large
amounts of data within Elasticsearch from a Jupyter Notebook
without overloading your machine.
[source,python]
-------------------------------------
>>> import eland as ed
>>> # Connect to 'flights' index via localhost Elasticsearch node
>>> df = ed.DataFrame('localhost:9200', 'flights')
# eland.DataFrame instance has the same API as pandas.DataFrame
# except all data is in Elasticsearch. See .info() memory usage.
>>> df.head()
AvgTicketPrice Cancelled ... dayOfWeek timestamp
0 841.265642 False ... 0 2018-01-01 00:00:00
1 882.982662 False ... 0 2018-01-01 18:27:00
2 190.636904 False ... 0 2018-01-01 17:11:14
3 181.694216 True ... 0 2018-01-01 10:33:28
4 730.041778 False ... 0 2018-01-01 05:13:00
[5 rows x 27 columns]
>>> df.info()
<class 'eland.dataframe.DataFrame'>
Index: 13059 entries, 0 to 13058
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AvgTicketPrice 13059 non-null float64
1 Cancelled 13059 non-null bool
2 Carrier 13059 non-null object
...
24 OriginWeather 13059 non-null object
25 dayOfWeek 13059 non-null int64
26 timestamp 13059 non-null datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 80.0 bytes
Elasticsearch storage usage: 5.043 MB
# Filtering of rows using comparisons
>>> df[(df.Carrier=="Kibana Airlines") & (df.AvgTicketPrice > 900.0) & (df.Cancelled == True)].head()
AvgTicketPrice Cancelled ... dayOfWeek timestamp
8 960.869736 True ... 0 2018-01-01 12:09:35
26 975.812632 True ... 0 2018-01-01 15:38:32
311 946.358410 True ... 0 2018-01-01 11:51:12
651 975.383864 True ... 2 2018-01-03 21:13:17
950 907.836523 True ... 2 2018-01-03 05:14:51
[5 rows x 27 columns]
# Running aggregations across an index
>>> df[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
DistanceKilometers AvgTicketPrice
sum 9.261629e+07 8.204365e+06
min 0.000000e+00 1.000205e+02
std 4.578263e+03 2.663867e+02
-------------------------------------

View File

@ -1,4 +1,4 @@
= eland
= Eland
:doctype: book
@ -7,3 +7,7 @@ include::{asciidoc-dir}/../../shared/attributes.asciidoc[]
include::overview.asciidoc[]
include::installation.asciidoc[]
include::dataframes.asciidoc[]
include::machine-learning.asciidoc[]

View File

@ -0,0 +1,31 @@
[[machine-learning]]
== Machine Learning
Eland allows transforming trained models from scikit-learn, XGBoost,
and LightGBM libraries to be serialized and used as an inference
model in Elasticsearch
[source,python]
------------------------
>>> from xgboost import XGBClassifier
>>> from eland.ml import MLModel
# Train and exercise an XGBoost ML model locally
>>> xgb_model = XGBClassifier(booster="gbtree")
>>> xgb_model.fit(training_data[0], training_data[1])
>>> xgb_model.predict(training_data[0])
[0 1 1 0 1 0 0 0 1 0]
# Import the model into Elasticsearch
>>> es_model = MLModel.import_model(
es_client="localhost:9200",
model_id="xgb-classifier",
model=xgb_model,
feature_names=["f0", "f1", "f2", "f3", "f4"],
)
# Exercise the ML model in Elasticsearch with the training data
>>> es_model.predict(training_data[0])
[0 1 1 0 1 0 0 0 1 0]
------------------------

View File

@ -8,8 +8,8 @@ Source code is available on https://github.com/elastic/eland[GitHub].
[discrete]
=== Compatibility
The library is compatible with all {es} versions since `7.6.x` but you
**have to use a matching major version**:
The library is compatible with Python 3.6 and later and all
{es} versions since `7.6.x` but you **have to use a matching major version**:
The recommended way to set your requirements in your `setup.py` or
`requirements.txt` is::
@ -53,7 +53,32 @@ Create a `DataFrame` object connected to an {es} cluster running on `localhost:9
[13059 rows x 27 columns]
------------------------------------
Eland can also be used for complex queries and aggregations:
[discrete]
==== Elastic Cloud
You can also connect Eland to an Elasticsearch instance in Elastic Cloud:
[source,python]
------------------------------------
>>> import eland as ed
>>> from elasticsearch import Elasticsearch
# First instantiate an 'Elasticsearch' instance connected to Elastic Cloud
>>> es = Elasticsearch(cloud_id="...", api_key=("...", "..."))
# then wrap the client in an Eland DataFrame:
>>> df = ed.DataFrame(es, es_index_pattern="flights")
>>> df.head(5)
AvgTicketPrice Cancelled ... dayOfWeek timestamp
0 841.265642 False ... 0 2018-01-01 00:00:00
1 882.982662 False ... 0 2018-01-01 18:27:00
2 190.636904 False ... 0 2018-01-01 17:11:14
3 181.694216 True ... 0 2018-01-01 10:33:28
4 730.041778 False ... 0 2018-01-01 05:13:00
[5 rows x 27 columns]
------------------------------------
Eland can be used for complex queries and aggregations:
[source,python]
------------------------------------

View File

@ -41,6 +41,15 @@ Indexing, Iteration
Function Application, GroupBy & Window
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. note::
Elasticsearch aggregations using cardinality (``count``) are accurate
approximations using the `HyperLogLog++ algorithm`_ so may not
be exact.
.. _HyperLogLog++ algorithm: https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf
.. autosummary::
:toctree: api/