mirror of
https://github.com/elastic/eland.git
synced 2025-07-11 00:02:14 +08:00
* Adding python 3.5 compatibility. Main issue is ordering of dictionaries. * Updating notebooks with 3.7 results. * Removing tempoorary code. * Defaulting to OrderedDict for python 3.5 + lint all code All code reformated by PyCharm and inspection results analysed. * Adding support for multiple arithmetic operations. Added new 'arithmetics' file to manage this process. More tests to be added + cleanup. * Signficant refactor to arithmetics and mappings. Work in progress. Tests don't pass. * Major refactor to Mappings. Field name mappings were stored in different places (Mappings, QueryCompiler, Operations) and needed to be keep in sync. With the addition of complex arithmetic operations this became complex and difficult to maintain. Therefore, all field naming is now in 'FieldMappings' which replaces 'Mappings'. Note this commit removes the cache for some of the mapped values and so the code is SIGNIFICANTLY slower on large indices. In addition, the addition of date_format to Mappings has been removed. This again added more unncessary complexity. * Adding OrderedDict for 3.5 compatibility * Fixes to ordering issues with 3.5
What is it?
eland is a elasticsearch client Python package to analyse, explore and manipulate data that resides in elasticsearch. Where possible the package uses existing Python APIs and data structures to make it easy to switch between numpy, pandas, scikit-learn to their elasticsearch powered equivalents. In general, the data resides in elasticsearch and not in memory, which allows eland to access large datasets stored in elasticsearch.
For example, to explore data in a large elasticsearch index, simply create an eland DataFrame from an elasticsearch index pattern, and explore using an API that mirrors a subset of the pandas.DataFrame API:
>>> import eland as ed
>>> df = ed.read_es('http://localhost:9200', 'reviews')
>>> df.head()
reviewerId vendorId rating date
0 0 0 5 2006-04-07 17:08
1 1 1 5 2006-05-04 12:16
2 2 2 4 2006-04-21 12:26
3 3 3 5 2006-04-18 15:48
4 3 4 5 2006-04-18 15:49
>>> df.describe()
reviewerId vendorId rating
count 578805.000000 578805.000000 578805.000000
mean 174124.098437 60.645267 4.679671
std 116951.972209 54.488053 0.800891
min 0.000000 0.000000 0.000000
25% 70043.000000 20.000000 5.000000
50% 161052.000000 44.000000 5.000000
75% 272697.000000 83.000000 5.000000
max 400140.000000 246.000000 5.000000
Connecting to Elasticsearch Cloud
>>> import eland as ed
>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch(cloud_id="<cloud_id>", http_auth=('<user>','<password>'))
>>> es.info()
{'name': 'instance-0000000000', 'cluster_name': 'bf900cfce5684a81bca0be0cce5913bc', 'cluster_uuid': 'xLPvrV3jQNeadA7oM4l1jA', 'version': {'number': '7.4.2', 'build_flavor': 'default', 'build_type': 'tar', 'build_hash': '2f90bbf7b93631e52bafb59b3b049cb44ec25e96', 'build_date': '2019-10-28T20:40:44.881551Z', 'build_snapshot': False, 'lucene_version': '8.2.0', 'minimum_wire_compatibility_version': '6.8.0', 'minimum_index_compatibility_version': '6.0.0-beta1'}, 'tagline': 'You Know, for Search'}
>>> df = ed.read_es(es, 'reviews')
Development Setup
- Create a virtual environment in Python
For example,
python3 -m venv env
- Activate the virtual environment
source env/bin/activate
- Install dependencies from the
requirements.txt
file
pip install -r requirements.txt
Why eland?
Naming is difficult, but as we had to call it something:
- eland = elastic and data
- eland = 'Elk/Moose' in Dutch (Alces alces)
- Elandsgracht = Amsterdam street near Elastic's Amsterdam office where historically hides from, among others, Elk were worked
Description
Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
big-datadata-analysisdataframedataframeselandelasticsearchetllightgbmmachine-learningpandaspythonscikit-learntime-series-forecasting
Readme
56 MiB
Languages
Python
70.7%
Jupyter Notebook
28.3%
Shell
0.9%