Adding 'development' section to docs

Adding contributing section based on Elasticsearch/CONTRIBUTING.md
TODO - add testing docs (based on CI)1
This commit is contained in:
Stephen Dodson 2019-11-20 10:32:35 +00:00
parent 2a409962ea
commit 6564f26245
9 changed files with 194 additions and 65 deletions

6
.gitignore vendored
View File

@ -10,6 +10,12 @@ build/
# docs build folder
docs/build/
# pytest results
eland/tests/dataframe/results/
eland/tests/dataframe/results/
result_images/
# Python egg metadata, regenerated from source files by setuptools.
/*.egg-info

View File

@ -1,58 +0,0 @@
# Implementation Notes
The goal of an `eland.DataFrame` is to enable users who are familiar with `pandas.DataFrame`
to access, explore and manipulate data that resides in Elasticsearch.
Ideally, all data should reside in Elasticsearch and not to reside in memory.
This restricts the API, but allows access to huge data sets that do not fit into memory, and allows
use of powerful Elasticsearch features such as aggrergations.
## Implementation Details
### 3rd Party System Access
Generally, integrations with [3rd party storage systems](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)
(SQL, Google Big Query etc.) involve accessing these systems and reading all external data into an
in-core pandas data structure. This also applies to [Apache Arrow](https://arrow.apache.org/docs/python/pandas.html)
structures.
Whilst this provides access to data in these systems, for large datasets this can require significant
in-core memory, and for systems such as Elasticsearch, bulk export of data can be an inefficient way
of exploring the data.
An alternative option is to create an API that proxies `pandas.DataFrame`-like calls to Elasticsearch
queries and operations. This could allow the Elasticsearch cluster to perform operations such as
aggregations rather than exporting all the data and performing this operation in-core.
### Implementation Options
An option would be to replace the `pandas.DataFrame` backend in-core memory structures with Elasticsearch
accessors. This would allow full access to the `pandas.DataFrame` APIs. However, this has issues:
* If a `pandas.DataFrame` instance maps to an index, typical manipulation of a `pandas.DataFrame`
may involve creating many derived `pandas.DataFrame` instances. Constructing an index per
`pandas.DataFrame` may result in many Elasticsearch indexes and a significant load on Elasticsearch.
For example, `df_a = df['a']` should not require Elasticsearch indices `df` and `df_a`
* Not all `pandas.DataFrame` APIs map to things we may want to do in Elasticsearch. In particular,
API calls that involve exporting all data from Elasticsearch into memory e.g. `df.to_dict()`.
* The backend `pandas.DataFrame` structures are not easily abstractable and are deeply embedded in
the implementation.
Another option is to create a `eland.DataFrame` API that mimics appropriate aspects of
the `pandas.DataFrame` API. This resolves some of the issues above as:
* `df_a = df['a']` could be implemented as a change to the Elasticsearch query used, rather
than a new index
* Instead of supporting the enitre `pandas.DataFrame` API we can support a subset appropriate for
Elasticsearch. If addition calls are required, we could to create a `eland.DataFrame._to_pandas()`
method which would explicitly export all data to a `pandas.DataFrame`
* Creating a new `eland.DataFrame` API gives us full flexibility in terms of implementation. However,
it does create a large amount of work which may duplicate a lot of the `pandas` code - for example,
printing objects etc. - this creates maintenance issues etc.

View File

@ -25,8 +25,7 @@ sys.path.extend(
# -- Project information -----------------------------------------------------
project = 'eland'
copyright = '2019, Stephen Dodson'
author = 'Stephen Dodson'
copyright = '2019, Elasticsearch B.V.'
# The full version, including alpha/beta/rc tags
release = '0.1'
@ -95,4 +94,4 @@ html_theme = "pandas_sphinx_theme"
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
#html_static_path = ['_static']

View File

@ -0,0 +1,167 @@
=====================
Contributing to eland
=====================
Eland is an open source project and we love to receive contributions
from our community — you! There are many ways to contribute, from
writing tutorials or blog posts, improving the documentation, submitting
bug reports and feature requests or writing code which can be
incorporated into eland itself.
Bug reports
-----------
If you think you have found a bug in eland, first make sure that you are
testing against the `latest version of
eland <https://github.com/elastic/eland>`__ - your issue may already
have been fixed. If not, search our `issues
list <https://github.com/elastic/eland/issues>`__ on GitHub in case a
similar issue has already been opened.
It is very helpful if you can prepare a reproduction of the bug. In
other words, provide a small test case which we can run to confirm your
bug. It makes it easier to find the problem and to fix it. Test cases
should be provided as python scripts, ideally with some details of your
Elasticsearch environment and index mappings, and (where appropriate) a
pandas example.
Provide as much information as you can. You may think that the problem
lies with your query, when actually it depends on how your data is
indexed. The easier it is for us to recreate your problem, the faster it
is likely to be fixed.
Feature requests
----------------
If you find yourself wishing for a feature that doesn't exist in eland,
you are probably not alone. There are bound to be others out there with
similar needs. Many of the features that eland has today have been added
because our users saw the need. Open an issue on our `issues
list <https://github.com/elastic/eland/issues>`__ on GitHub which
describes the feature you would like to see, why you need it, and how it
should work.
Contributing code and documentation changes
-------------------------------------------
If you have a bugfix or new feature that you would like to contribute to
eland, please find or open an issue about it first. Talk about what you
would like to do. It may be that somebody is already working on it, or
that there are particular issues that you should know about before
implementing the change.
We enjoy working with contributors to get their code accepted. There are
many approaches to fixing a problem and it is important to find the best
approach before writing too much code.
Note that it is unlikely the project will merge refactors for the sake
of refactoring. These types of pull requests have a high cost to
maintainers in reviewing and testing with little to no tangible benefit.
This especially includes changes generated by tools. For example,
converting all generic interface instances to use the diamond operator.
The process for contributing to any of the `Elastic
repositories <https://github.com/elastic/>`__ is similar. Details for
individual projects can be found below.
Fork and clone the repository
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You will need to fork the main eland code or documentation repository
and clone it to your local machine. See `github help
page <https://help.github.com/articles/fork-a-repo>`__ for help.
Further instructions for specific projects are given below.
Submitting your changes
~~~~~~~~~~~~~~~~~~~~~~~
Once your changes and tests are ready to submit for review:
1. Test your changes
Run the test suite to make sure that nothing is broken (TODO add link
to testing doc).
2. Sign the Contributor License Agreement
Please make sure you have signed our `Contributor License
Agreement <https://www.elastic.co/contributor-agreement/>`__. We are
not asking you to assign copyright to us, but to give us the right to
distribute your code without restriction. We ask this of all
contributors in order to assure our users of the origin and
continuing existence of the code. You only need to sign the CLA once.
3. Rebase your changes
Update your local repository with the most recent code from the main
eland repository, and rebase your branch on top of the latest master
branch. We prefer your initial changes to be squashed into a single
commit. Later, if we ask you to make changes, add them as separate
commits. This makes them easier to review. As a final step before
merging we will either ask you to squash all commits yourself or
we'll do it for you.
4. Submit a pull request
Push your local changes to your forked copy of the repository and
`submit a pull
request <https://help.github.com/articles/using-pull-requests>`__. In
the pull request, choose a title which sums up the changes that you
have made, and in the body provide more details about what your
changes do. Also mention the number of the issue where discussion has
taken place, eg “Closes #123”.
Then sit back and wait. There will probably be discussion about the pull
request and, if any changes are needed, we would love to work with you
to get your pull request merged into eland.
Please adhere to the general guideline that you should never force push
to a publicly shared branch. Once you have opened your pull request, you
should consider your branch publicly shared. Instead of force pushing
you can just add incremental commits; this is generally easier on your
reviewers. If you need to pick up changes from master, you can merge
master into your branch. A reviewer might ask you to rebase a
long-running pull request in which case force pushing is okay for that
request. Note that squashing at the end of the review process should
also not be done, that can be done when the pull request is `integrated
via GitHub <https://github.com/blog/2141-squash-your-commits>`__.
Contributing to the eland codebase
----------------------------------
**Repository:** https://github.com/elastic/eland
We internally develop using the PyCharm IDE. For PyCharm, we are
currently using a minimum version of PyCharm 2019.2.4.
Configuring PyCharm And Running Tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(All commands should be run from module root)
- Create a new project via 'Check out from Version Control'->'Git'
on the "Welcome to PyCharm" page (or other)
- Enter the URL to your fork of eland
(e.g. ``git@github.com:stevedodson/eland.git``)
- Click 'Yes' for 'Checkout from Version Control'
- Configure PyCharm environment:
- In 'Preferences' configure a 'Project: eland'->'Project Interpreter'.
Generally, we recommend creating a virtual environment (TODO link to
installing for python version support).
- In 'Preferences' set 'Tools'->'Python Integrated Tools'->'Default
test runner' to ``pytest``
- In 'Preferences' set 'Tools'->'Python Integrated Tools'->'Docstring
format' to ``numpy``
- Install development requirements. Open terminal in virtual
environment and run ``pip install -r requirements-dev.txt``
- Setup Elasticsearch instance (assumes ``localhost:9200``), and run
``python -m eland.tests.setup_tests`` to setup test environment -
*note this modifies Elasticsearch indices*
- Run ``pytest --doctest-modules`` to validate install
Documentation
~~~~~~~~~~~~~
- Install documentation requirements. Open terminal in virtual
environment and run ``pip install -r requirements-dev.txt``

View File

@ -0,0 +1,10 @@
.. _development:
===========
Development
===========
.. toctree::
:maxdepth: 2
contributing.rst

View File

@ -9,7 +9,7 @@ to access, explore and manipulate data that resides in Elasticsearch.
Ideally, all data should reside in Elasticsearch and not to reside in memory.
This restricts the API, but allows access to huge data sets that do not fit into memory, and allows
use of powerful Elasticsearch features such as aggrergations.
use of powerful Elasticsearch features such as aggregations.
Pandas and 3rd Party Storage Systems

View File

@ -24,6 +24,7 @@ In general, the data resides in elasticsearch and not in memory, which allows el
reference/index
implementation/index
development/index
* :doc:`reference/index`
@ -38,3 +39,7 @@ In general, the data resides in elasticsearch and not in memory, which allows el
* :doc:`implementation/details`
* :doc:`implementation/dataframe_supported`
* :doc:`development/index`
* :doc:`development/contributing`

View File

@ -1,6 +1,6 @@
__title__ = 'eland'
__description__ = 'Python elasticsearch client to analyse, explore and manipulate data that resides in elasticsearch.'
__url__ = 'https://github.com/elastic/app-search-python'
__version__ = '0.1'
__url__ = 'https://github.com/elastic/eland'
__version__ = '0.1a1'
__maintainer__ = 'Elasticsearch B.V.'
__maintainer_email__ = 'steve.dodson@elastic.co'

View File

@ -23,7 +23,7 @@ setup(
maintainer_email=about['__maintainer_email__'],
license='Apache 2.0',
classifiers=[
'Development Status :: 4 - Beta',
'Development Status :: 3 - Alpha',
'Intended Audience :: Developers',
'License :: OSI Approved :: Apache Software License',
'Programming Language :: Python :: 3.7',