Nikhil Das Nomula
Github actions is a great tool for CI however we recently had a data engineering usecase where the client did not have cloud infrastructure in place but had a requirement to move the data from a source to a destination but also wanted that to run on a schedule(cron). In this article we will go over why and how we ended up using Github Actions and its ability to run cron to address this particular use case.
We have written a python script to achieve the data engineering task and we had to think what would be best for our client to achieve automation.
There were the options
The reason we chose github actions is simplicity for this usecase.
If we had chosen the first two options, we would have to set github connectivity, handle environment variables in a different place other than github where the script resides. Apache Airflow and Prefect are an overkill for what we are trying to achieve
From a cost perspective github actions are free, there is a caveat that the schedule might not run sometimes when loads are high on runners but that was not an issue for us in this particular usecase.
The best part is when we transition this to the client they just have one technology instead of a bunch of them that they would have to manage.
Lets get into how. Here is the code for github actions is pretty simple and this is how it looks like
As you can see, we set up the cron and use the standard python-dotenv to access secrets so that secrets are not in your code. This takes care of major concerns and provide a simplistic solution. That being said, this approach is not suited for every need. The option to choose depends on multiple factors.
name: run <path-to-your-python-file>.py on: workflow_dispatch:# This sets up manual trigger that comes in pretty handy to test schedule: - cron: '*/10 * * * *' # Every 10 minutes jobs: build: runs-on: ubuntu-latest steps: - name: checkout repo content uses: actions/checkout@v3 # checkout the repository content to github runner - name: setup python uses: actions/setup-python@v4 with: python-version: '3.12' # setup Python, you can change the version here - name: Install pipenv run: | python -m pip install --upgrade pipenv wheel - name: Install dependencies run: | pip install python-dotenv==1.0.1 # This is the dependency you need to have to handle sensitive values. In addition to this you can add other dependencies here that your application needs - name: Run the script env: ENV_VAR1: ${{ secrets.ENV_VAR1 }} # We create secrets in github actions and then access them here taking care of security run: | python src/<path-to-your-python-file>.py
If you have any questions, feel free to reach out to us at nikhil.nomula@yajur.tech
I have been reading about data mesh architecture by Zhamak Dehgani and it has been thought provoking in thinking how data is handled in organizations.
Read MoreData plays a big role in AI. To give some perspective ChatGPT-3 was trained on multiple sources that include web pages, books, Wikipedia, and articles.
Read MoreOne of the fundamental things in business intelligence and data science is to fetch the data from a source. Most of the time companies expose their data
Read More