Nikhil Das Nomula
In this post we will go over three approaches that we see across organizations when it comes to data engineering. The three approaches are
Data warehouse has been around for a while and the concept behind this is centralizing all the organization's data into a global schema. The way was done was to move data traditionally via ETL jobs from multiple source transactional databases to a centralized analytics data base where this data can then be used by data analysts and data scientists.
Data Lake takes after the concept of data warehouse where we centralize the data but the difference here is that this data does not necessarily have to be structured. This gained momentum once cloud computing started becoming the norm and the data was moved to this data lakes(which is S3 for AWS, Google Cloud Storage for GCP and Azure Blob Storage for Azure). This got momentum for a while but quickly organizations realized that there needs to be some structure and it gave rise to ELT(Extract Load and Transform) instead of Extract Transform and Load(ETL)
As we see in both the Data warehouse and Data Lake architecture we see that data is centralized and a centralized data team is responsible for maintaining this data.
Data mesh provides a contrasting approach where it proposes that we decentralize the ownership of data and let teams take ownership of their data(called data products). We have written an article datamesh-what-is-it that goes into detail.
What approach is right for your organization depends on the size of your organization. If you want to learn more about this feel free to schedule some time here https://yajur.youcanbook.me/
I have been reading about data mesh architecture by Zhamak Dehgani and it has been thought provoking in thinking how data is handled in organizations.
Read MoreData plays a big role in AI. To give some perspective ChatGPT-3 was trained on multiple sources that include web pages, books, Wikipedia, and articles.
Read More