mirror of
https://github.com/elastic/eland.git
synced 2025-07-11 00:02:14 +08:00
59 lines
3.0 KiB
Markdown
59 lines
3.0 KiB
Markdown
# Implementation Notes
|
|
|
|
The goal of an `eland.DataFrame` is to enable users who are familiar with `pandas.DataFrame`
|
|
to access, explore and manipulate data that resides in Elasticsearch.
|
|
|
|
Ideally, all data should reside in Elasticsearch and not to reside in memory.
|
|
This restricts the API, but allows access to huge data sets that do not fit into memory, and allows
|
|
use of powerful Elasticsearch features such as aggrergations.
|
|
|
|
## Implementation Details
|
|
|
|
### 3rd Party System Access
|
|
|
|
Generally, integrations with [3rd party storage systems](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)
|
|
(SQL, Google Big Query etc.) involve accessing these systems and reading all external data into an
|
|
in-core pandas data structure. This also applies to [Apache Arrow](https://arrow.apache.org/docs/python/pandas.html)
|
|
structures.
|
|
|
|
Whilst this provides access to data in these systems, for large datasets this can require significant
|
|
in-core memory, and for systems such as Elasticsearch, bulk export of data can be an inefficient way
|
|
of exploring the data.
|
|
|
|
An alternative option is to create an API that proxies `pandas.DataFrame`-like calls to Elasticsearch
|
|
queries and operations. This could allow the Elasticsearch cluster to perform operations such as
|
|
aggregations rather than exporting all the data and performing this operation in-core.
|
|
|
|
### Implementation Options
|
|
|
|
An option would be to replace the `pandas.DataFrame` backend in-core memory structures with Elasticsearch
|
|
accessors. This would allow full access to the `pandas.DataFrame` APIs. However, this has issues:
|
|
|
|
* If a `pandas.DataFrame` instance maps to an index, typical manipulation of a `pandas.DataFrame`
|
|
may involve creating many derived `pandas.DataFrame` instances. Constructing an index per
|
|
`pandas.DataFrame` may result in many Elasticsearch indexes and a significant load on Elasticsearch.
|
|
For example, `df_a = df['a']` should not require Elasticsearch indices `df` and `df_a`
|
|
|
|
* Not all `pandas.DataFrame` APIs map to things we may want to do in Elasticsearch. In particular,
|
|
API calls that involve exporting all data from Elasticsearch into memory e.g. `df.to_dict()`.
|
|
|
|
* The backend `pandas.DataFrame` structures are not easily abstractable and are deeply embedded in
|
|
the implementation.
|
|
|
|
Another option is to create a `eland.DataFrame` API that mimics appropriate aspects of
|
|
the `pandas.DataFrame` API. This resolves some of the issues above as:
|
|
|
|
* `df_a = df['a']` could be implemented as a change to the Elasticsearch query used, rather
|
|
than a new index
|
|
|
|
* Instead of supporting the enitre `pandas.DataFrame` API we can support a subset appropriate for
|
|
Elasticsearch. If addition calls are required, we could to create a `eland.DataFrame._to_pandas()`
|
|
method which would explicitly export all data to a `pandas.DataFrame`
|
|
|
|
* Creating a new `eland.DataFrame` API gives us full flexibility in terms of implementation. However,
|
|
it does create a large amount of work which may duplicate a lot of the `pandas` code - for example,
|
|
printing objects etc. - this creates maintenance issues etc.
|
|
|
|
|
|
|