52 Commits

Author SHA1 Message Date
Seth Michael Larson
18fb4af731 Document DataFrame.groupby() and rename Field.index -> .column 2020-10-15 17:11:29 -05:00
P. Sai Vinay
abc5ca927b
Add support for DataFrame.groupby() with aggregations 2020-10-15 10:52:48 -05:00
Seth Michael Larson
adafeed667
Add es_dtypes property to DataFrame and Series 2020-10-13 12:14:09 -05:00
P. Sai Vinay
b7c6c26606
Change DataFrame.filter() to preserve the order of items 2020-10-13 10:58:09 -05:00
P. Sai Vinay
4d96ad39fd
Switch agg defaults to numeric_only=None 2020-09-22 10:32:27 -05:00
Seth Michael Larson
ceacf759c3
Add long Apache-2.0 license header to all files 2020-07-08 15:10:43 -05:00
Seth Michael Larson
6000ea73d0
Add [DataFrame, Series].filter() 2020-05-20 12:45:30 -05:00
Seth Michael Larson
1378544933
Normalize and prune top-level APIs 2020-05-18 14:55:41 -05:00
Daniel Mesejo-León
94dbb36081
Add .sample() method to DataFrame and Series 2020-05-04 12:07:21 -05:00
Seth Michael Larson
15a1977dcf
Add agg compatibility logic to Field class 2020-04-27 15:16:48 -05:00
Seth Michael Larson
7946eb4daa
Add an enforce license headers 2020-04-25 16:26:58 -05:00
Seth Michael Larson
33b4976f9a
Add type hints to base modules 2020-04-24 12:39:13 -05:00
Seth Michael Larson
448770df78
Restrict public API, update license header 2020-04-14 07:31:23 -05:00
Daniel Mesejo-León
e8f307d2e0
Add NDFrame.median() aggregation 2020-04-13 08:48:39 -05:00
Daniel Mesejo-León
7a1c636e56
Add NDFrame.var() and .std() aggregations 2020-04-12 15:48:13 -05:00
Seth Michael Larson
064d43b9ef
Remove eland.Client, use Elasticsearch directly 2020-04-06 07:25:25 -05:00
Seth Michael Larson
7e5f0d3913 Add DataFrame.es_query() to query Elasticsearch directly 2020-04-02 13:06:22 -05:00
Seth Michael Larson
0c1d7222fe
Drop support for Python 3.5, add Black 2020-03-27 07:56:28 -05:00
stevedodson
c5f5d00bb0
Adding support for df['timestamp'].min() etc. (#122)
There is still a difference between pandas/eland in terms
of min/max etc. aggregations as pandas supports this
on strings.
2020-01-30 11:03:37 +00:00
stevedodson
2ca538c49d
Feature/show progress (#120)
* Adding show_progress debug option to eland_to_pandas

* Adding show_progress debug option to eland_to_pandas
2020-01-29 12:59:48 +00:00
stevedodson
409cb043c8
Refactoring of plotting + fixes for multiple charts (#117)
* Refactoring of plotting + fixes for multiple charts

Updates to plotting inline with pandas 0.25.3
Enables plotting of multiple histograms on the
same figure.

* Fix to setup.py to allow submodules

+ reformat of code and better Series.hist docs
2020-01-29 07:07:56 +00:00
stevedodson
903fbf0341
Feature/mapping cache (#103)
* Adding python 3.5 compatibility.

Main issue is ordering of dictionaries.

* Updating notebooks with 3.7 results.

* Removing tempoorary code.

* Defaulting to OrderedDict for python 3.5 + lint all code

All code reformated by PyCharm and inspection results analysed.

* Adding support for multiple arithmetic operations.

Added new 'arithmetics' file to manage this process.
More tests to be added + cleanup.

* Signficant refactor to arithmetics and mappings.

Work in progress. Tests don't pass.

* Major refactor to Mappings.

Field name mappings were stored in different places
(Mappings, QueryCompiler, Operations) and needed to
be keep in sync.

With the addition of complex arithmetic operations
this became complex and difficult to maintain. Therefore,
all field naming is now in 'FieldMappings' which
replaces 'Mappings'.

Note this commit removes the cache for some of the
mapped values and so the code is SIGNIFICANTLY
slower on large indices.

In addition, the addition of date_format to
Mappings has been removed. This again added more
unncessary complexity.

* Adding OrderedDict for 3.5 compatibility

* Fixes to ordering issues with 3.5

* Adding simple cache for mappings in flatten

Improves performance significantly on large
datasets (>10000 rows).

* Adding updated notebooks (new info_es).

All tests (doc + nbval + pytest) pass.
2020-01-10 08:12:03 +00:00
stevedodson
efe21a6d87
Feature/arithmetic ops (#102)
* Adding python 3.5 compatibility.

Main issue is ordering of dictionaries.

* Updating notebooks with 3.7 results.

* Removing tempoorary code.

* Defaulting to OrderedDict for python 3.5 + lint all code

All code reformated by PyCharm and inspection results analysed.

* Adding support for multiple arithmetic operations.

Added new 'arithmetics' file to manage this process.
More tests to be added + cleanup.

* Signficant refactor to arithmetics and mappings.

Work in progress. Tests don't pass.

* Major refactor to Mappings.

Field name mappings were stored in different places
(Mappings, QueryCompiler, Operations) and needed to
be keep in sync.

With the addition of complex arithmetic operations
this became complex and difficult to maintain. Therefore,
all field naming is now in 'FieldMappings' which
replaces 'Mappings'.

Note this commit removes the cache for some of the
mapped values and so the code is SIGNIFICANTLY
slower on large indices.

In addition, the addition of date_format to
Mappings has been removed. This again added more
unncessary complexity.

* Adding OrderedDict for 3.5 compatibility

* Fixes to ordering issues with 3.5
2020-01-10 08:05:43 +00:00
stevedodson
c5730e6d38
Feature/python 3.5 (#93)
* Adding python 3.5 compatibility.

Main issue is ordering of dictionaries.

* Updating notebooks with 3.7 results.

* Removing tempoorary code.

* Defaulting to OrderedDict for python 3.5 + lint all code

All code reformated by PyCharm and inspection results analysed.
2019-12-11 14:27:35 +01:00
stevedodson
133b227b93
Added example notebooks + pytest for notebooks (#87)
* Added example notebooks + pytest for these notebooks1

* Fixed paths

* Fixing link in docs

* Adding cleaner demo_notebook
2019-12-10 15:27:13 +01:00
stevedodson
206276c5fa
Adding Apache 2 copyright header to all .py files (#86) 2019-12-06 09:44:05 +00:00
stevedodson
f06219f0ec
Feature/refactor tasks (#83)
* Significant refactor of task list in operations.py

Classes based on composite pattern replace tuples for
tasks.

* Addressing review comments for eland/operations.py

* Minor update to review fixes

* Minor fix for some better handling of non-aggregatable fields: https://github.com/elastic/eland/issues/71

* Test for non-aggrgatable value_counts

* Refactoring tasks/actions

* Removing debug and fixing doctest
2019-12-06 08:46:43 +00:00
Francesco Vigliaturo
99bfea42b6
Added support for 2 date formats: (#70)
* Adds support for multiple date formats
2019-12-04 17:42:50 +01:00
Stephen Dodson
bf6c56878a Correcting license files + fixing bug in filter
LICENSE and NOTICE conform to Elastic policy. Bug in
nested negated filters fixed.

Also, some limited cleanup.
2019-12-03 13:56:49 +00:00
Michael Hirsch
a3dd86075a
String Arithmetics: __add__ ops (#68)
* adds support for __add__ ops for string objects and literals

* adds tests for string arithmetic

* updates comment in numeric field resolution

* adds op_type parameter for numeric_ops
2019-11-27 10:44:17 -05:00
Stephen Dodson
86686ebb18 Reformat and cleanup based on PyCharm 2019-11-26 11:02:46 +00:00
Stephen Dodson
ac8cb302de Updates based on PR review. 2019-11-25 12:43:37 +00:00
Stephen Dodson
84e23ab5d1 Added Series metric aggs + Series docs
Also, improved Series.to_string()
2019-11-22 15:44:55 +00:00
Stephen Dodson
5d119215f8 Fixing rename and truediv issues
tests pass
TODO - implement additional orithmetic ops
2019-11-21 20:37:54 +00:00
Stephen Dodson
c12bf9357b Series rename and arithmetic initial implementation
Partially implemented, tests fail with this commit.
2019-11-21 15:39:13 +00:00
Michael Hirsch
9c03d5a0d4 instantiates column as series with specified dtype 2019-11-19 13:13:08 -05:00
Michael Hirsch
9c9ca90c0d
Adds Support for Series.value_counts() (#49)
* adds support for series.value_counts

* adds docs for series.value_counts

* adds tests for series.value_counts

* updates keyerror language

* adds es docs as an external source

* adds parameters for metrics and terms aggs

* adds 2 tests to check for exceptions

* explains the size parameter

* removes print statements from tests

* checks that es_size is a positive integer

* implements assert_series_equal
2019-11-19 11:27:15 -05:00
Stephen Dodson
2f4d601932 Adding eland.read_csv
TODO - resolve issue with ordering of eland.DataFrame compared to csv
2019-11-15 15:14:12 +00:00
Stephen Dodson
f5025b9f39 Renamed ed_to_pd eland_to_pandas and added docs.
+ added some additions to .gitignore
+ removed DataFrame.squeeze for now
2019-11-15 11:21:27 +00:00
Stephen Dodson
dff49d01fe More doc updates. 2019-11-13 18:23:43 +00:00
Stephen Dodson
8de7a1db7d Resolved minor PyCharm issues 2019-11-05 13:31:10 +00:00
Stephen Dodson
c1ee409a33 Major cleanup - removed modin as dependency
modin removed as a dependency and iloc feature
removed for now - TODO add back in.
2019-11-04 13:13:42 +00:00
Stephen Dodson
337bef1c5d Demo day notebook + minor updates added 2019-08-15 12:26:58 +00:00
Stephen Dodson
ef289bfe78 Adding partial DataFrame.query support
Only > and == currently implemented for PoC. 'query'
language not supported yet.
2019-08-14 14:44:04 +00:00
Stephen Dodson
49bad292d3 Added DataFrame.to_csv - tests still failing 2019-08-09 07:54:44 +00:00
Stephen Dodson
c6e0c5b92b Adding smaller test and first effort to implement aggs 2019-08-06 14:58:38 +00:00
Stephen Dodson
67b7aee9c9 Adding DataFrame.hist tests and DataFrame.select_dtypes 2019-08-01 12:55:17 +00:00
Stephen Dodson
3435ffac1b Adding first implementation of eland.DataFrame.hist 2019-07-31 09:59:52 +00:00
Stephen Dodson
1fa4d3fbe7 Partial implementation of hist - does not work
Backup push
2019-07-12 15:24:32 +00:00
Stephen Dodson
d71ce9f50c Adding drop + the ability for operations to have a query
Significant refactor - needs cleanup
2019-07-11 10:11:57 +00:00