45 Commits

Author SHA1 Message Date
Seth Michael Larson
7946eb4daa
Add an enforce license headers 2020-04-25 16:26:58 -05:00
Seth Michael Larson
33b4976f9a
Add type hints to base modules 2020-04-24 12:39:13 -05:00
Daniel Mesejo-León
a779f04a6d
Add default dtype to empty pd.Series
Suppress pandas DeprecationWarning with default dtype on empty pd.Series
2020-04-19 08:51:10 -05:00
Stephen Dodson
1bc83d15e7
Change var/std aggs to use sample instead of population 2020-04-15 14:16:12 -05:00
Seth Michael Larson
448770df78
Restrict public API, update license header 2020-04-14 07:31:23 -05:00
Daniel Mesejo-León
e8f307d2e0
Add NDFrame.median() aggregation 2020-04-13 08:48:39 -05:00
Daniel Mesejo-León
7a1c636e56
Add NDFrame.var() and .std() aggregations 2020-04-12 15:48:13 -05:00
Seth Michael Larson
064d43b9ef
Remove eland.Client, use Elasticsearch directly 2020-04-06 07:25:25 -05:00
Seth Michael Larson
29af76101e
Fix unpacking of median aggregation 2020-04-03 07:56:09 -05:00
Daniel Mesejo-León
e27a508c59
Update supported Pandas to v1.0 2020-03-27 12:21:15 -05:00
Seth Michael Larson
0c1d7222fe
Drop support for Python 3.5, add Black 2020-03-27 07:56:28 -05:00
Stephen Dodson
43e4d03b39
Too long frame exception2 (#137)
* Updating test matrix for 7.6 + removing oss for now.

* Resolving 7.6.0 docs issues

* Updating ML docs

* Fixing too_long_frame_exception in scan/scroll
2020-02-28 12:49:59 +00:00
Stephen Dodson
a33ff45ebc
Too long frame exception fixes (#135)
* Updating test matrix for 7.6 + removing oss for now.

* Resolving 7.6.0 docs issues

* Updating ML docs

* Resolving too_long_frame_exception on large mappings

- Embedded _source parameters in bodt rather than url
- Fixed bug in DataFrame.info on empty DataFrame
- Removed warning from TestImportedMLModel

* Resolving too_long_frame_exception on large mappings

- Embedded _source parameters in bodt rather than url
- Fixed bug in DataFrame.info on empty DataFrame
- Removed warning from TestImportedMLModel
2020-02-26 12:50:14 +00:00
stevedodson
c5f5d00bb0
Adding support for df['timestamp'].min() etc. (#122)
There is still a difference between pandas/eland in terms
of min/max etc. aggregations as pandas supports this
on strings.
2020-01-30 11:03:37 +00:00
stevedodson
2ca538c49d
Feature/show progress (#120)
* Adding show_progress debug option to eland_to_pandas

* Adding show_progress debug option to eland_to_pandas
2020-01-29 12:59:48 +00:00
stevedodson
a3293168a1
Feature/filtered hist (#104)
* Adding python 3.5 compatibility.

Main issue is ordering of dictionaries.

* Updating notebooks with 3.7 results.

* Removing tempoorary code.

* Defaulting to OrderedDict for python 3.5 + lint all code

All code reformated by PyCharm and inspection results analysed.

* Adding support for multiple arithmetic operations.

Added new 'arithmetics' file to manage this process.
More tests to be added + cleanup.

* Signficant refactor to arithmetics and mappings.

Work in progress. Tests don't pass.

* Major refactor to Mappings.

Field name mappings were stored in different places
(Mappings, QueryCompiler, Operations) and needed to
be keep in sync.

With the addition of complex arithmetic operations
this became complex and difficult to maintain. Therefore,
all field naming is now in 'FieldMappings' which
replaces 'Mappings'.

Note this commit removes the cache for some of the
mapped values and so the code is SIGNIFICANTLY
slower on large indices.

In addition, the addition of date_format to
Mappings has been removed. This again added more
unncessary complexity.

* Adding OrderedDict for 3.5 compatibility

* Fixes to ordering issues with 3.5

* Adding simple cache for mappings in flatten

Improves performance significantly on large
datasets (>10000 rows).

* Adding updated notebooks (new info_es).

All tests (doc + nbval + pytest) pass.

* Fixing issue with non-zero offset histograms.
2020-01-10 08:17:45 +00:00
stevedodson
efe21a6d87
Feature/arithmetic ops (#102)
* Adding python 3.5 compatibility.

Main issue is ordering of dictionaries.

* Updating notebooks with 3.7 results.

* Removing tempoorary code.

* Defaulting to OrderedDict for python 3.5 + lint all code

All code reformated by PyCharm and inspection results analysed.

* Adding support for multiple arithmetic operations.

Added new 'arithmetics' file to manage this process.
More tests to be added + cleanup.

* Signficant refactor to arithmetics and mappings.

Work in progress. Tests don't pass.

* Major refactor to Mappings.

Field name mappings were stored in different places
(Mappings, QueryCompiler, Operations) and needed to
be keep in sync.

With the addition of complex arithmetic operations
this became complex and difficult to maintain. Therefore,
all field naming is now in 'FieldMappings' which
replaces 'Mappings'.

Note this commit removes the cache for some of the
mapped values and so the code is SIGNIFICANTLY
slower on large indices.

In addition, the addition of date_format to
Mappings has been removed. This again added more
unncessary complexity.

* Adding OrderedDict for 3.5 compatibility

* Fixes to ordering issues with 3.5
2020-01-10 08:05:43 +00:00
Michael Hirsch
79fdb1727e
Add Support for Series Histograms (#95)
* add support for series plotting
* update docs for series plotting support
* add tests for series plotting
* fix typo
* adds comment to ed_hist_series
2019-12-11 14:51:47 -05:00
stevedodson
c5730e6d38
Feature/python 3.5 (#93)
* Adding python 3.5 compatibility.

Main issue is ordering of dictionaries.

* Updating notebooks with 3.7 results.

* Removing tempoorary code.

* Defaulting to OrderedDict for python 3.5 + lint all code

All code reformated by PyCharm and inspection results analysed.
2019-12-11 14:27:35 +01:00
stevedodson
206276c5fa
Adding Apache 2 copyright header to all .py files (#86) 2019-12-06 09:44:05 +00:00
stevedodson
f06219f0ec
Feature/refactor tasks (#83)
* Significant refactor of task list in operations.py

Classes based on composite pattern replace tuples for
tasks.

* Addressing review comments for eland/operations.py

* Minor update to review fixes

* Minor fix for some better handling of non-aggregatable fields: https://github.com/elastic/eland/issues/71

* Test for non-aggrgatable value_counts

* Refactoring tasks/actions

* Removing debug and fixing doctest
2019-12-06 08:46:43 +00:00
Michael Hirsch
f263e21b8a Better Handling of Non Aggregatable Fields (#85)
* updates ecommerce mapping to include non-aggregatable text field

* updates exists tests and adds new tests for non-aggregatable field

* better handling on non-aggregatable fields

* fixes formatting

* swaps series in assertion

* adds newline
2019-12-06 08:20:09 +00:00
Stephen Dodson
bf6c56878a Correcting license files + fixing bug in filter
LICENSE and NOTICE conform to Elastic policy. Bug in
nested negated filters fixed.

Also, some limited cleanup.
2019-12-03 13:56:49 +00:00
Michael Hirsch
a3dd86075a
String Arithmetics: __add__ ops (#68)
* adds support for __add__ ops for string objects and literals

* adds tests for string arithmetic

* updates comment in numeric field resolution

* adds op_type parameter for numeric_ops
2019-11-27 10:44:17 -05:00
Stephen Dodson
86686ebb18 Reformat and cleanup based on PyCharm 2019-11-26 11:02:46 +00:00
Stephen Dodson
b99f25e4ee Adding __r* operations and resolving issues with df.info() 2019-11-25 15:00:02 +00:00
Stephen Dodson
ac8cb302de Updates based on PR review. 2019-11-25 12:43:37 +00:00
Stephen Dodson
84e23ab5d1 Added Series metric aggs + Series docs
Also, improved Series.to_string()
2019-11-22 15:44:55 +00:00
Stephen Dodson
5d119215f8 Fixing rename and truediv issues
tests pass
TODO - implement additional orithmetic ops
2019-11-21 20:37:54 +00:00
Stephen Dodson
c12bf9357b Series rename and arithmetic initial implementation
Partially implemented, tests fail with this commit.
2019-11-21 15:39:13 +00:00
Michael Hirsch
9c9ca90c0d
Adds Support for Series.value_counts() (#49)
* adds support for series.value_counts

* adds docs for series.value_counts

* adds tests for series.value_counts

* updates keyerror language

* adds es docs as an external source

* adds parameters for metrics and terms aggs

* adds 2 tests to check for exceptions

* explains the size parameter

* removes print statements from tests

* checks that es_size is a positive integer

* implements assert_series_equal
2019-11-19 11:27:15 -05:00
Stephen Dodson
f5025b9f39 Renamed ed_to_pd eland_to_pandas and added docs.
+ added some additions to .gitignore
+ removed DataFrame.squeeze for now
2019-11-15 11:21:27 +00:00
Stephen Dodson
5a546577f4 Resolving DataFrame.query issues + more docs 2019-11-14 20:04:38 +00:00
Stephen Dodson
c1ee409a33 Major cleanup - removed modin as dependency
modin removed as a dependency and iloc feature
removed for now - TODO add back in.
2019-11-04 13:13:42 +00:00
Stephen Dodson
337bef1c5d Demo day notebook + minor updates added 2019-08-15 12:26:58 +00:00
Stephen Dodson
ef289bfe78 Adding partial DataFrame.query support
Only > and == currently implemented for PoC. 'query'
language not supported yet.
2019-08-14 14:44:04 +00:00
Stephen Dodson
49bad292d3 Added DataFrame.to_csv - tests still failing 2019-08-09 07:54:44 +00:00
Stephen Dodson
c6e0c5b92b Adding smaller test and first effort to implement aggs 2019-08-06 14:58:38 +00:00
Stephen Dodson
3435ffac1b Adding first implementation of eland.DataFrame.hist 2019-07-31 09:59:52 +00:00
Stephen Dodson
1fa4d3fbe7 Partial implementation of hist - does not work
Backup push
2019-07-12 15:24:32 +00:00
Stephen Dodson
9bf3505b7e Cleanup + removed dependence of elasticsearch-dsl 2019-07-11 10:17:04 +00:00
Stephen Dodson
d71ce9f50c Adding drop + the ability for operations to have a query
Significant refactor - needs cleanup
2019-07-11 10:11:57 +00:00
Stephen Dodson
a73c999290 iloc is (mainly) working. 2019-07-09 10:02:08 +00:00
Stephen Dodson
d0ea715c31 Added test data and additional test cases 2019-07-04 19:25:47 +00:00
Stephen Dodson
15e0c37182 Major refactor. eland is now backed by modin.
First push, still not functional.
2019-07-04 13:00:19 +00:00