mirror of
https://github.com/elastic/eland.git
synced 2025-07-11 00:02:14 +08:00
271 lines
9.7 KiB
Markdown
271 lines
9.7 KiB
Markdown
_Note, this project is still very much a work in progress and in an alpha state; input and contributions welcome!_
|
||
|
||
<p align="center">
|
||
<a href="https://github.com/elastic/eland">
|
||
<img src="./docs/source/logo/eland.png" width="30%" alt="eland" />
|
||
</a>
|
||
</p>
|
||
<table>
|
||
<tr>
|
||
<td>PyPI</td>
|
||
<td>
|
||
<a href="https://pypi.org/project/eland/">
|
||
<img src="https://img.shields.io/pypi/v/eland.svg" alt="latest release"/>
|
||
</a>
|
||
</td>
|
||
</tr>
|
||
<tr>
|
||
<td>Conda Forge</td>
|
||
<td>
|
||
<a href="https://anaconda.org/conda-forge/eland">
|
||
<img src="https://img.shields.io/conda/vn/conda-forge/eland" alt="latest release"/>
|
||
</a>
|
||
</td>
|
||
</tr>
|
||
<tr>
|
||
<td>Package Status</td>
|
||
<td>
|
||
<a href="https://pypi.org/project/eland/">
|
||
<img src="https://img.shields.io/pypi/status/eland.svg" alt="status" />
|
||
</a>
|
||
</td>
|
||
</tr>
|
||
<tr>
|
||
<td>License</td>
|
||
<td>
|
||
<a href="https://github.com/elastic/eland/blob/master/LICENSE.txt">
|
||
<img src="https://img.shields.io/pypi/l/eland.svg" alt="license" />
|
||
</a>
|
||
</td>
|
||
</tr>
|
||
<tr>
|
||
<td>Build Status</td>
|
||
<td>
|
||
<a href="https://clients-ci.elastic.co/job/elastic+eland+master/">
|
||
<img src="https://clients-ci.elastic.co/buildStatus/icon?job=elastic%2Beland%2Bmaster" alt="Build Status" />
|
||
</a>
|
||
</td>
|
||
</tr>
|
||
</table>
|
||
|
||
# What is it?
|
||
|
||
eland is a Elasticsearch client Python package to analyse, explore and manipulate data that resides in Elasticsearch.
|
||
Where possible the package uses existing Python APIs and data structures to make it easy to switch between numpy,
|
||
pandas, scikit-learn to their Elasticsearch powered equivalents. In general, the data resides in Elasticsearch and
|
||
not in memory, which allows eland to access large datasets stored in Elasticsearch.
|
||
|
||
For example, to explore data in a large Elasticsearch index, simply create an eland DataFrame from an Elasticsearch
|
||
index pattern, and explore using an API that mirrors a subset of the pandas.DataFrame API:
|
||
|
||
```
|
||
>>> import eland as ed
|
||
|
||
>>> # Connect to 'flights' index via localhost Elasticsearch node
|
||
>>> df = ed.DataFrame('localhost:9200', 'flights')
|
||
|
||
>>> df.head()
|
||
AvgTicketPrice Cancelled ... dayOfWeek timestamp
|
||
0 841.265642 False ... 0 2018-01-01 00:00:00
|
||
1 882.982662 False ... 0 2018-01-01 18:27:00
|
||
2 190.636904 False ... 0 2018-01-01 17:11:14
|
||
3 181.694216 True ... 0 2018-01-01 10:33:28
|
||
4 730.041778 False ... 0 2018-01-01 05:13:00
|
||
|
||
[5 rows x 27 columns]
|
||
|
||
>>> df.describe()
|
||
AvgTicketPrice DistanceKilometers ... FlightTimeMin dayOfWeek
|
||
count 13059.000000 13059.000000 ... 13059.000000 13059.000000
|
||
mean 628.253689 7092.142457 ... 511.127842 2.835975
|
||
std 266.386661 4578.263193 ... 334.741135 1.939365
|
||
min 100.020531 0.000000 ... 0.000000 0.000000
|
||
25% 410.008918 2470.545974 ... 251.739008 1.000000
|
||
50% 640.387285 7612.072403 ... 503.148975 3.000000
|
||
75% 842.262193 9735.660463 ... 720.505705 4.239865
|
||
max 1199.729004 19881.482422 ... 1902.901978 6.000000
|
||
|
||
[8 rows x 7 columns]
|
||
>>> df[['Carrier', 'AvgTicketPrice', 'Cancelled']]
|
||
Carrier AvgTicketPrice Cancelled
|
||
0 Kibana Airlines 841.265642 False
|
||
1 Logstash Airways 882.982662 False
|
||
2 Logstash Airways 190.636904 False
|
||
3 Kibana Airlines 181.694216 True
|
||
4 Kibana Airlines 730.041778 False
|
||
... ... ... ...
|
||
13054 Logstash Airways 1080.446279 False
|
||
13055 Logstash Airways 646.612941 False
|
||
13056 Logstash Airways 997.751876 False
|
||
13057 JetBeats 1102.814465 False
|
||
13058 JetBeats 858.144337 False
|
||
|
||
[13059 rows x 3 columns]
|
||
|
||
>>> df[(df.Carrier=="Kibana Airlines") & (df.AvgTicketPrice > 900.0) & (df.Cancelled == True)].head()
|
||
AvgTicketPrice Cancelled ... dayOfWeek timestamp
|
||
8 960.869736 True ... 0 2018-01-01 12:09:35
|
||
26 975.812632 True ... 0 2018-01-01 15:38:32
|
||
311 946.358410 True ... 0 2018-01-01 11:51:12
|
||
651 975.383864 True ... 2 2018-01-03 21:13:17
|
||
950 907.836523 True ... 2 2018-01-03 05:14:51
|
||
|
||
[5 rows x 27 columns]
|
||
|
||
>>> df[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
|
||
DistanceKilometers AvgTicketPrice
|
||
sum 9.261629e+07 8.204365e+06
|
||
min 0.000000e+00 1.000205e+02
|
||
std 4.578263e+03 2.663867e+02
|
||
|
||
>>> df[['Carrier', 'Origin', 'Dest']].nunique()
|
||
Carrier 4
|
||
Origin 156
|
||
Dest 156
|
||
dtype: int64
|
||
|
||
>>> s = df.AvgTicketPrice * 2 + df.DistanceKilometers - df.FlightDelayMin
|
||
>>> s
|
||
0 18174.857422
|
||
1 10589.365723
|
||
2 381.273804
|
||
3 739.126221
|
||
4 14818.327637
|
||
...
|
||
13054 10219.474121
|
||
13055 8381.823975
|
||
13056 12661.157104
|
||
13057 20819.488281
|
||
13058 18315.431274
|
||
Length: 13059, dtype: float64
|
||
>>> print(s.info_es())
|
||
index_pattern: flights
|
||
Index:
|
||
index_field: _id
|
||
is_source_field: False
|
||
Mappings:
|
||
capabilities:
|
||
es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name
|
||
NaN script_field_None False double None float64 True True True script_field_None
|
||
Operations:
|
||
tasks: []
|
||
size: None
|
||
sort_params: None
|
||
_source: ['script_field_None']
|
||
body: {'script_fields': {'script_field_None': {'script': {'source': "(((doc['AvgTicketPrice'].value * 2) + doc['DistanceKilometers'].value) - doc['FlightDelayMin'].value)"}}}}
|
||
post_processing: []
|
||
|
||
>>> pd_df = ed.eland_to_pandas(df)
|
||
>>> pd_df.head()
|
||
AvgTicketPrice Cancelled ... dayOfWeek timestamp
|
||
0 841.265642 False ... 0 2018-01-01 00:00:00
|
||
1 882.982662 False ... 0 2018-01-01 18:27:00
|
||
2 190.636904 False ... 0 2018-01-01 17:11:14
|
||
3 181.694216 True ... 0 2018-01-01 10:33:28
|
||
4 730.041778 False ... 0 2018-01-01 05:13:00
|
||
|
||
[5 rows x 27 columns]
|
||
```
|
||
|
||
See [docs](https://eland.readthedocs.io/en/latest) and [demo_notebook.ipynb](https://eland.readthedocs.io/en/latest/examples/demo_notebook.html) for more examples.
|
||
|
||
## Where to get it
|
||
|
||
Eland can be installed from [PyPI](https://pypi.org/project/eland) via pip:
|
||
|
||
```bash
|
||
$ python -m pip install eland
|
||
```
|
||
|
||
Eland can also be installed from [Conda Forge](https://anaconda.org/conda-forge/eland) with Conda:
|
||
|
||
```bash
|
||
$ conda install -c conda-forge eland
|
||
```
|
||
|
||
The [source code](https://github.com/elastic/eland) is currently available on GitHub.
|
||
|
||
## Versions and Compatibility
|
||
|
||
### Python Version Support
|
||
|
||
Officially Python 3.6 and above.
|
||
|
||
eland depends on pandas version 1.0.0+.
|
||
|
||
### Elasticsearch Versions
|
||
|
||
eland is versioned like the Elastic stack (eland 7.5.1 is compatible with Elasticsearch 7.x up to 7.5.1)
|
||
|
||
A major version of the client is compatible with the same major version of Elasticsearch.
|
||
|
||
No compatibility assurances are given between different major versions of the client and Elasticsearch.
|
||
Major differences likely exist between major versions of Elasticsearch,
|
||
particularly around request and response object formats, but also around API urls and behaviour.
|
||
|
||
## Connecting to Elasticsearch
|
||
|
||
eland uses the [Elasticsearch low level client](https://elasticsearch-py.readthedocs.io/) to connect to Elasticsearch.
|
||
This client supports a range of [connection options and authentication mechanisms]
|
||
(https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch).
|
||
|
||
### Basic Connection Options
|
||
|
||
```
|
||
>>> import eland as ed
|
||
|
||
>>> # Connect to flights index via localhost Elasticsearch node
|
||
>>> ed.DataFrame('localhost', 'flights')
|
||
|
||
>>> # Connect to flights index via localhost Elasticsearch node on port 9200
|
||
>>> ed.DataFrame('localhost:9200', 'flights')
|
||
|
||
>>> # Connect to flights index via localhost Elasticsearch node on port 9200 with <user>:<password> credentials
|
||
>>> ed.DataFrame('http://<user>:<password>@localhost:9200', 'flights')
|
||
|
||
>>> # Connect to flights index via ssl
|
||
>>> es = Elasticsearch(
|
||
'https://<user>:<password>@localhost:443',
|
||
use_ssl=True,
|
||
verify_certs=True,
|
||
ca_certs='/path/to/ca.crt'
|
||
)
|
||
>>> ed.DataFrame(es, 'flights')
|
||
|
||
>>> # Connect to flights index via ssl using Urllib3HttpConnection options
|
||
>>> es = Elasticsearch(
|
||
['localhost:443', 'other_host:443'],
|
||
use_ssl=True,
|
||
verify_certs=True,
|
||
ca_certs='/path/to/CA_certs',
|
||
client_cert='/path/to/clientcert.pem',
|
||
client_key='/path/to/clientkey.pem'
|
||
)
|
||
>>> ed.DataFrame(es, 'flights')
|
||
```
|
||
|
||
### Connecting to an Elasticsearch Cloud Cluster
|
||
|
||
```
|
||
>>> import eland as ed
|
||
>>> from elasticsearch import Elasticsearch
|
||
|
||
>>> es = Elasticsearch(cloud_id="<cloud_id>", http_auth=('<user>','<password>'))
|
||
|
||
>>> es.info()
|
||
{'name': 'instance-0000000000', 'cluster_name': 'bf900cfce5684a81bca0be0cce5913bc', 'cluster_uuid': 'xLPvrV3jQNeadA7oM4l1jA', 'version': {'number': '7.4.2', 'build_flavor': 'default', 'build_type': 'tar', 'build_hash': '2f90bbf7b93631e52bafb59b3b049cb44ec25e96', 'build_date': '2019-10-28T20:40:44.881551Z', 'build_snapshot': False, 'lucene_version': '8.2.0', 'minimum_wire_compatibility_version': '6.8.0', 'minimum_index_compatibility_version': '6.0.0-beta1'}, 'tagline': 'You Know, for Search'}
|
||
|
||
>>> df = ed.read_es(es, 'reviews')
|
||
```
|
||
|
||
## Why eland?
|
||
|
||
Naming is difficult, but as we had to call it something:
|
||
|
||
* eland: elastic and data
|
||
* eland: 'Elk/Moose' in Dutch (Alces alces)
|
||
* [Elandsgracht](https://goo.gl/maps/3hGBMqeGRcsBJfKx8): Amsterdam street near Elastic's Amsterdam office
|
||
|
||
[Pronunciation](https://commons.wikimedia.org/wiki/File:Nl-eland.ogg): /ˈeːlɑnt/
|
||
|