mirror of
https://github.com/elastic/eland.git
synced 2025-07-11 00:02:14 +08:00
Adding contributing section based on Elasticsearch/CONTRIBUTING.md TODO - add testing docs (based on CI)1
62 lines
3.2 KiB
ReStructuredText
62 lines
3.2 KiB
ReStructuredText
.. _implementation/details:
|
|
|
|
======================
|
|
Implementation Details
|
|
======================
|
|
|
|
The goal of an ``eland.DataFrame`` is to enable users who are familiar with ``pandas.DataFrame``
|
|
to access, explore and manipulate data that resides in Elasticsearch.
|
|
|
|
Ideally, all data should reside in Elasticsearch and not to reside in memory.
|
|
This restricts the API, but allows access to huge data sets that do not fit into memory, and allows
|
|
use of powerful Elasticsearch features such as aggregations.
|
|
|
|
|
|
Pandas and 3rd Party Storage Systems
|
|
------------------------------------
|
|
|
|
Generally, integrations with [3rd party storage systems](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)
|
|
(SQL, Google Big Query etc.) involve accessing these systems and reading all external data into an
|
|
in-core pandas data structure. This also applies to [Apache Arrow](https://arrow.apache.org/docs/python/pandas.html)
|
|
structures.
|
|
|
|
Whilst this provides access to data in these systems, for large datasets this can require significant
|
|
in-core memory, and for systems such as Elasticsearch, bulk export of data can be an inefficient way
|
|
of exploring the data.
|
|
|
|
An alternative option is to create an API that proxies ``pandas.DataFrame``-like calls to Elasticsearch
|
|
queries and operations. This could allow the Elasticsearch cluster to perform operations such as
|
|
aggregations rather than exporting all the data and performing this operation in-core.
|
|
|
|
Implementation Options
|
|
----------------------
|
|
|
|
An option would be to replace the ``pandas.DataFrame`` backend in-core memory structures with Elasticsearch
|
|
accessors. This would allow full access to the ``pandas.DataFrame`` APIs. However, this has issues:
|
|
|
|
* If a ``pandas.DataFrame`` instance maps to an index, typical manipulation of a ``pandas.DataFrame``
|
|
may involve creating many derived ``pandas.DataFrame`` instances. Constructing an index per
|
|
``pandas.DataFrame`` may result in many Elasticsearch indexes and a significant load on Elasticsearch.
|
|
For example, ``df_a = df['a']`` should not require Elasticsearch indices ``df`` and ``df_a``
|
|
|
|
* Not all ``pandas.DataFrame`` APIs map to things we may want to do in Elasticsearch. In particular,
|
|
API calls that involve exporting all data from Elasticsearch into memory e.g. ``df.to_dict()``.
|
|
|
|
* The backend ``pandas.DataFrame`` structures are not easily abstractable and are deeply embedded in
|
|
the implementation.
|
|
|
|
Another option is to create a ``eland.DataFrame`` API that mimics appropriate aspects of
|
|
the ``pandas.DataFrame`` API. This resolves some of the issues above as:
|
|
|
|
* ``df_a = df['a']`` could be implemented as a change to the Elasticsearch query used, rather
|
|
than a new index
|
|
|
|
* Instead of supporting the enitre ``pandas.DataFrame`` API we can support a subset appropriate for
|
|
Elasticsearch. If addition calls are required, we could to create a ``eland.DataFrame._to_pandas()``
|
|
method which would explicitly export all data to a ``pandas.DataFrame``
|
|
|
|
* Creating a new ``eland.DataFrame`` API gives us full flexibility in terms of implementation. However,
|
|
it does create a large amount of work which may duplicate a lot of the ``pandas`` code - for example,
|
|
printing objects etc. - this creates maintenance issues etc.
|
|
|