mirror of https://github.com/elastic/eland.git synced 2025-07-11 00:02:14 +08:00

Go to file

* Adding python 3.5 compatibility.

Main issue is ordering of dictionaries.

* Updating notebooks with 3.7 results.

* Removing tempoorary code.

* Defaulting to OrderedDict for python 3.5 + lint all code

All code reformated by PyCharm and inspection results analysed.

* Adding support for multiple arithmetic operations.

Added new 'arithmetics' file to manage this process.
More tests to be added + cleanup.

* Signficant refactor to arithmetics and mappings.

Work in progress. Tests don't pass.

* Major refactor to Mappings.

Field name mappings were stored in different places
(Mappings, QueryCompiler, Operations) and needed to
be keep in sync.

With the addition of complex arithmetic operations
this became complex and difficult to maintain. Therefore,
all field naming is now in 'FieldMappings' which
replaces 'Mappings'.

Note this commit removes the cache for some of the
mapped values and so the code is SIGNIFICANTLY
slower on large indices.

In addition, the addition of date_format to
Mappings has been removed. This again added more
unncessary complexity.

* Adding OrderedDict for 3.5 compatibility

* Fixes to ordering issues with 3.5

2020-01-10 08:05:43 +00:00

.ci

Move to latest .ci script structure (#101 )

2020-01-09 11:18:56 +01:00

docs

Feature/arithmetic ops (#102 )

2020-01-10 08:05:43 +00:00

eland

Feature/arithmetic ops (#102 )

2020-01-10 08:05:43 +00:00

.dockerignore

First attempt at suitable Dockerfile and config for building Docker image for eland

2019-11-25 19:15:37 +01:00

.gitignore

Reformat and cleanup based on PyCharm

2019-11-26 11:02:46 +00:00

CONTRIBUTING.md

Added example notebooks + pytest for notebooks (#87 )

2019-12-10 15:27:13 +01:00

LICENSE.txt

Correcting license files + fixing bug in filter

2019-12-03 13:56:49 +00:00

make_docs.sh

Added example notebooks + pytest for notebooks (#87 )

2019-12-10 15:27:13 +01:00

MANIFEST.in

Feature/refactor tasks (#83 )

2019-12-06 08:46:43 +00:00

NOTICE.txt

Correcting license files + fixing bug in filter

2019-12-03 13:56:49 +00:00

README.md

Adds build status sticker to README and runs test on different Python versions (#84 )

2019-12-11 15:41:34 +01:00

requirements-dev.txt

Feature/pandas.0.25.3 (#91 )

2019-12-10 16:05:37 +01:00

requirements.txt

Feature/pandas.0.25.3 (#91 )

2019-12-10 16:05:37 +01:00

run_build.sh

Making run_build.sh executable

2019-11-29 08:58:31 +01:00

setup.py

Feature/python 3.5 (#93 )

2019-12-11 14:27:35 +01:00

README.md

What is it?

eland is a elasticsearch client Python package to analyse, explore and manipulate data that resides in elasticsearch. Where possible the package uses existing Python APIs and data structures to make it easy to switch between numpy, pandas, scikit-learn to their elasticsearch powered equivalents. In general, the data resides in elasticsearch and not in memory, which allows eland to access large datasets stored in elasticsearch.

For example, to explore data in a large elasticsearch index, simply create an eland DataFrame from an elasticsearch index pattern, and explore using an API that mirrors a subset of the pandas.DataFrame API:

>>> import eland as ed

>>> df = ed.read_es('http://localhost:9200', 'reviews') 

>>> df.head()
   reviewerId  vendorId  rating              date
0           0         0       5  2006-04-07 17:08
1           1         1       5  2006-05-04 12:16
2           2         2       4  2006-04-21 12:26
3           3         3       5  2006-04-18 15:48
4           3         4       5  2006-04-18 15:49

>>> df.describe()
          reviewerId       vendorId         rating
count  578805.000000  578805.000000  578805.000000
mean   174124.098437      60.645267       4.679671
std    116951.972209      54.488053       0.800891
min         0.000000       0.000000       0.000000
25%     70043.000000      20.000000       5.000000
50%    161052.000000      44.000000       5.000000
75%    272697.000000      83.000000       5.000000
max    400140.000000     246.000000       5.000000

Connecting to Elasticsearch Cloud

>>> import eland as ed
>>> from elasticsearch import Elasticsearch

>>> es = Elasticsearch(cloud_id="<cloud_id>", http_auth=('<user>','<password>'))

>>> es.info()
{'name': 'instance-0000000000', 'cluster_name': 'bf900cfce5684a81bca0be0cce5913bc', 'cluster_uuid': 'xLPvrV3jQNeadA7oM4l1jA', 'version': {'number': '7.4.2', 'build_flavor': 'default', 'build_type': 'tar', 'build_hash': '2f90bbf7b93631e52bafb59b3b049cb44ec25e96', 'build_date': '2019-10-28T20:40:44.881551Z', 'build_snapshot': False, 'lucene_version': '8.2.0', 'minimum_wire_compatibility_version': '6.8.0', 'minimum_index_compatibility_version': '6.0.0-beta1'}, 'tagline': 'You Know, for Search'}

>>> df = ed.read_es(es, 'reviews')

Development Setup

Create a virtual environment in Python

For example,

python3 -m venv env

Activate the virtual environment

source env/bin/activate

Install dependencies from the requirements.txt file

pip install -r requirements.txt

Why eland?

Naming is difficult, but as we had to call it something:

eland = elastic and data
eland = 'Elk/Moose' in Dutch (Alces alces)
Elandsgracht = Amsterdam street near Elastic's Amsterdam office where historically hides from, among others, Elk were worked