In this work, we detail the design and structure of a Synopses Data Engine (SDE) which combines the virtues of parallel processing and stream summarization towards delivering interactive analytics at extreme scale. Our SDE is built on top of Apache Flink and implements a novel synopsis-as-a-service (SDEaaS) paradigm. In that, it achieves (i) concurrently maintaining thousands of synopses of various types for thousands of streams, on demand, (ii) reusing synopses that are common across various concurrent workflows, (iii) providing data summarization facilities even for cross-(Big Data) platform workflows, (iv) pluggability of new synopses on-the-fly, (v) increased potential for workflow execution optimization. The proposed SDE-as-a-service provides interactive analytics at scale by enabling 3 types of scalability: (i) enhanced horizontal scalability, i.e., not only scaling out the computation to a number of processing units available in a computer cluster, but also harnessing the processing load assigned to each by operating on carefully-crafted data summaries, (ii) vertical scalability, i.e., scaling the computation to very high numbers of processed streams and (iii) federated scalability i.e., scaling across geo-distributed clusters and clouds by controlling the communication required to answer global queries.
Our SDEaaS design (i) is already incorporated in a commercial analytics platform [3], namely RapidMiner Studio, (ii) is available open-source [4] and (iii) has been put into production in various real-world applications [5][6].
SDEaaS proof-of-concept implementation is based on Apache Kafka and Flink. The SDEaaS operates as shown in the figure below:
When a request for maintaining a new synopsis is issued, it reaches the ``RegisterRequest`` and ``RegisterSynopsis`` FlatMaps which produce keys for workers (i.e., VM resources) which will handle this synopsis. Each of this pair of FlatMaps uses these keys for a different purpose as explained below.
``RegisterRequest`` uses the keys to direct queries to responsible workers, while ``RegisterSynopsis`` uses the keys to update the synopses on new data arrivals (blue-coloured path). In particular, when a new streaming data tuple is ingested, the ``HashData`` FlatMap looks up the keys of ``RegisterSynopsis`` to see to which workers the tuple should be directed to update the synopsis. This update is performed by the ``add`` FlatMap in the blue-coloured path. The rest of the operators the figure are used for merging partial synopses results maintained across workers or even across geo-distributed computer clusters. Please refer to [1][2] for further details.
The Source code of SDEaaS is available here: https://github.com/akontaxakis/SDE along with a SDE client: https://github.com/akontaxakis/SDE/tree/master/a-client-for-the-synopsis-data-engine to facilitate issuing requests and integration in broader, even cross-Big Data platform, workflows [6].
[1] Antonios Kontaxakis, Nikos Giatrakos, Dimitris Sacharidis, Antonios Deligiannakis: And Synopses for All: a Synopses Data Engine for Extreme Scale Analytics-as-a-Service, Inf. Syst, (Under Review)
[2] Antonios Kontaxakis, Nikos Giatrakos, Antonios Deligiannakis: A Synopses Data Engine for Interactive Extreme-Scale Analytics., CIKM 2020: 2085-2088
[3] Rapidminer studio, streaming extension, https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_streaming
[4] https://github.com/akontaxakis/SDE
[5] Marios Vodas, Konstantina Bereta, Dimitris Kladis, Dimitris Zissis, Elias Alevizos, Emmanouil Ntoulias, Alexander Artikis, Antonios Deligiannakis, Antonios Kontaxakis, Nikos Giatrakos, David Arnu, Edwin Yaqub, Fabian Temme, Mate Torok, Ralf Klinkenberg: Online Distributed Maritime Event Detection & Forecasting over Big Vessel Tracking Data. IEEE BigData 2021: 2052-2057
[6] George Stamatakis, Antonis Kontaxakis, Alkis Simitsis, Nikos Giatrakos, Antonios Deligiannakis: SheerMP: Optimized Streaming Analytics-as-a-Service over Multi-site and Multi-platform Settings. EDBT 2022: 2:558-2:561