Knowledge is a robust instrument that can be utilized to enhance many features of our lives, together with our well being. With the proliferation of wearable health trackers, well being apps, and different monitoring units, it has change into simpler than ever to gather and analyze knowledge about our well being. By monitoring and analyzing this knowledge, we will achieve helpful insights into our well being and wellness, enabling us to make extra knowledgeable selections about our existence and habits.
Well being units, which allow you to trace all of your well being metrics in a single place, make it simple to watch your progress and make knowledgeable selections about your well being. This weblog reveals how you should utilize your gadget and its knowledge to offer much more actionable insights. The instance I am going to stroll by makes use of Apple Healthkit to carry out superior analytics and machine studying, and construct a dashboard with related KPIs and metrics. The purpose is assist observe my weekly and month-to-month efficiency throughout these metrics so I can monitor and obtain my well being objectives. Impressed by the weblog publish “You Are What You Measure,” my intent is to measure my approach to good well being! You may be part of within the enjoyable too (Github repo right here).
The Basis of Our Knowledge-driven Health Journey
With the explosion of knowledge volumes at our fingertips and the myriad of instruments to amass, remodel analyze and visualize – it’s simple to be overwhelmed. The lakehouse structure simplifies knowledge use instances by offering all the vital capabilities obtainable underneath one platform. Along with unifying workflows and knowledge groups, the Databricks Lakehouse Platform – powered by Delta Lake –makes knowledge warehouse-level options (like ACID transactions, governance, and efficiency) obtainable on knowledge lake scale, flexibility, and price.
To energy our dashboard and analytics, we’ll be leveraging Apple HealthKit, which along with nice monitoring, gives knowledge sharing capabilities from third-party apps within the iOS ecosystem. However to take it a step additional, we’ll even be utilizing full extent of the lakehouse! It makes use of a mix of Apache Spark, Databricks SQL, and MLflow to extract additional insights, aggregations, and KPI monitoring to maintain me trustworthy all through 2023. We’ll stroll by learn how to make the most of Delta Reside Tables to orchestrate streaming ETL course of, use a metadata pushed ETL framework for knowledge transformation, and expose a dashboard with related KPIs to make data-driven actions!
Within the subsequent sections of this blogpost, we’ll present learn how to:
- Export your well being knowledge
- Make the most of Delta Reside Tables to orchestrate streaming ETL
- Use a metadata-driven ETL framework to categorise and carry out transformations of our knowledge
- Expose a dashboard with related KPIs
- Make data-driven actions!
Step one is to make the info obtainable. There are a number of choices to export Apple Healthkit knowledge, together with constructing your personal integration with the accompanying APIs or third-party apps. The strategy we’ll take is documented on the official HealthKit web site by exporting immediately from the app. Comply with these easy directions under to export your knowledge:
- Guarantee you might have related knowledge in Well being Utility (corresponding to steps and heartrate)
- Export well being knowledge and add to your cloud storage of selection (I take advantage of Google Drive)
- Confirm that export.zip file is accessible in Google Drive
As proven in Determine 1, our knowledge will make a number of stops alongside the best way to visualization. As soon as knowledge is accessible on object storage, we’ll course of it by the Medallion framework – taking uncooked XML (bronze), breaking out disparate datasets (silver), and aggregating related KPIs on minute, hourly, and every day foundation to current to the serving tier (gold).
Knowledge Verification and Sharing
To make sure knowledge is accessible, log into your goal Google Drive account (or wherever your export was uploaded to) and discover the filename export.zip. As soon as situated, please guarantee file permissions mirror “anybody with the hyperlink,” and replica the hyperlink for later use.
Knowledge Acquisition and Preparation
Now that our knowledge is accessible, it is time to arrange our pocket book for knowledge entry, governance, and ingestion into Databricks. Step one is to put in and import vital libraries and setup variables we’ll be reusing to automate extractions and transformations in later steps.
For knowledge acquisition, we’ll be utilizing gDown, a neat little library that makes downloading information from Google Drive easy and environment friendly. At this level, all you must do is copy your shared hyperlink from Google Drive, a vacation spot folder, and develop .zip archive to the /tmp listing.
When exploring the contents of export.zip, there are a number of attention-grabbing datasets obtainable. These knowledge embrace exercise routes (in .gpx format), electrocardiograms (in .csv), and HealthKit data (in .xml). For our functions we’re concentrating on export.xml, which tracks most of our well being metrics in Apple HealthKit. Please be aware, if export.xml comprises tens of millions of data (like mine does), chances are you’ll want to extend the dimensions of the Apache Spark driver for processing. Please confer with GitHub for reference.
Earlier than continuing, we’ll do a fast assessment of the dataframe utilizing Bamboolib, which gives a lowcode/no-code strategy to exploring the info and making use of transformations, altering datatypes, or performing aggregations with minimal code. This can give us nice insights into our knowledge and alert us to any doable knowledge high quality issues. Take a look at EDA pocket book
As seen in Determine 3, the export.xml consists of greater than 3.8M data throughout 55 forms of knowledge. By this exploration, we see it is a comparatively clear dataset, with minimal null values within the worth column, which is necessary as a result of it shops metrics based mostly on the column kind. The kind column consists of various obtainable metrics – from sleep monitoring to heartbeats per minute.
Now that we perceive the info form and relationships, we will transfer on to the enjoyable stuff!
Go for the GOLD – Medallion framework!
Upon additional inspection, the xml – shouldn’t be so easy in any case. As offered by our Bamboolib evaluation, though contained in a single XML, it’s truly 55 totally different metrics which can be tracked. Enter Lakehouse! We’ll apply the ETL Medallion framework on the lakehouse to curate our knowledge lake and course of our knowledge for downstream consumption by knowledge science and BI groups – all on low cost object storage!
Processing and touchdown this knowledge into the Delta format permits us to maintain uncooked knowledge and begin reaping the advantages.
As a part of the ELT course of, we’ll be benefiting from Delta Reside Tables (DLT) to automate and simplify our knowledge processing. DLT gives the benefit of a declarative framework to automate features, which might be in any other case manually developed by engineering groups, together with streaming pipeline duties corresponding to checkpointing and auto-scaling with enhanced autoscaling, knowledge validation duties with expectations, and pipeline observability metrics.
Knowledge Exploration and Manipulation
We’ll base our subsequent evaluation on the ‘kind’ column, which defines the info within the payload. For instance, HeartRate kind could have totally different metrics tracked when in comparison with ActiveEnergyBurned kind. As seen under – our xml file truly comprises 55+ metrics (no less than in my case – YMMV) tracked by Apple healthKit.
Every of those sources represents an distinctive metric being tracked by Apple HealthKit – concerning totally different features of total well being. For instance, it can pull metrics from environmental decibel ranges and lat/lon of exercises, to coronary heart price and energy burned. Along with distinctive measurements, they supply totally different timescales – from per exercise, per day, to per second measurements – a wealthy dataset certainly!
We wish to guarantee we observe any knowledge high quality issues all through the pipeline. From our earlier investigation, it appeared that some values may need been measured incorrectly – which could skew our knowledge (coronary heart price above 200??). Fortunately, DLT makes knowledge high quality points manageable utilizing expectations and pipeline metrics. A DLT expectation is a particular situation you anticipate of the occasion (for instance IS NOT NULL OR x>5), which DLT will take an motion on. These actions may very well be only for monitoring functions (“dlt.anticipate”) or might embrace dropping the occasion (‘dlt.expect_or_drop’) or failing the desk/pipeline (‘dlt.expect_or_fail’). For extra data on DLT and Expectations, please confer with the next web page (hyperlink).
As talked about above, every kind will present distinctive insights into your total well being. With over 55 differing types and metrics, it may be a frightening process to handle. Particularly when new metrics and our knowledge sources pop into the pipeline. For that reason, we’ll leverage a metadata-driven framework to simplify and modularize our pipeline.
For our subsequent step, we ship over 10 distinctive sources to particular person silver tables with particular columns and vital transformations to make the info ML and BI-ready. We use the metadata-driven framework, which simplifies and accelerates improvement throughout totally different knowledge sources, to pick particular values from our bronze desk and remodel into particular person silver tables. These transforms shall be based mostly on the kind column in our bronze desk. On this instance, we’ll be extracting a subset of knowledge sources, however the metadata desk is definitely prolonged to incorporate further metrics/tables as new knowledge sources come up. The metadata desk is represented under and comprises columns that drive our DLT framework, together with source_name, table_name, columns, expectations, and feedback.
We incorporate our metadata desk into our DLT pipeline by leveraging looping capabilities obtainable within the python API. Mixing capabilities between SQL and Python makes DLT a particularly highly effective framework for improvement and transformation. We’ll learn within the Delta desk (however may very well be something Spark can learn right into a dataframe; see instance metadata.json in repo) and loop through to extract metadata variables for our silver tables utilizing the table_iterator perform.
This straightforward snippet of code, accompanied by a metadata desk, will learn in knowledge from bronze, and supply extractions and distinctive columns to over 10 downstream silver tables. This course of is additional outlined within the “DLT_bronze2silver” pocket book which comprises knowledge ingestion (autoloader) and metadata-driven transformations for silver tables. Under is an instance of the DLT DAG created based mostly on the totally different sources obtainable in Apple HealthKit.
And clear datasets!
Lastly, we mix a number of attention-grabbing datasets – on this case, coronary heart price and exercise data. We then create ancillary metrics (like HeartRate Zones) and carry out by-minute and day aggregations to make downstream analytics extra performant and applicable for consumption. That is additional outlined within the ‘DLT_AppleHealth_iterator_Gold’ pocket book within the repo.
With our knowledge obtainable and cleaned up, we’re in a position to construct out some dashboards to assist us observe and visualize our journey. On this case, I constructed a easy dashboard utilizing capabilities included in Databricks SQL to trace KPIs that may assist me obtain my objectives, together with exercise time, coronary heart price variability, exercise efforts, total averages, and quick and long-term developments. After all, if you’re more adept in different knowledge visualization instruments (like Energy BI or Tableau), Databricks might be totally built-in into your current workflow.
Under is an easy dashboard with related metrics, break up by 7 day and 30 day averages. I like getting a view throughout KPIs and time all in a single dashboard. This can assist information my exercise program to make sure I repeatedly enhance!
With such a wealthy dataset, you may as well begin delving into ML and analyze all measures of well being. Since I am not a knowledge scientist, I leveraged the built-in functionality of autoML to forecast my weight reduction based mostly on some gold and silver tables! AutoML gives a simple and intuitive approach to prepare fashions, automate hyperparameters tuning, and combine with MLflow for experiment monitoring and Serving of mannequin!
Hopefully this experiment offered a consumable introduction to Databricks and a few of the nice characteristic performance obtainable on the platform.
Now it is your flip to leverage the facility of knowledge to alter conduct in a optimistic manner! Get began with your personal experiment.