Setting up a Data Warehouse on the Google Cloud Platform - Converge Consulting

Setting up a Data Warehouse on the Google Cloud Platform

As businesses and individuals are moving towards storing their data and information “in the cloud”, so too are higher education institutions and the agencies that service them. At Converge we have successfully built a data warehouse on the Google Cloud Platform. Let’s talk through why you might need a data warehouse, if it is right for higher education, and the architecture of our Converge data warehouse.
 

Why do we need a Data Warehouse?

The concept of a data warehouse is to create a permanent storage space for the data needed to support reporting, analysis, and other business intelligence (BI) functions. While it may seem wasteful to store data in multiple places (source systems and the data warehouse,) the many advantages of doing this more than justify the effort and expense. The overarching goal is to be able to connect and integrate all of the data from various sources for a client and create a comprehensive dashboard while pulling advanced analytics.
 

Is it Right for Higher Education?

Our clients consistently ask for additional insight into the student journey. For example, which students enrolled after interacting with a campaign or how prospective students are interacting with client web pages. Some of these questions can only be answered if we combine and report on data from various sources – campaign data to CRM data (where applicant data is recorded). To be able to make any significant analysis, we need to gather and store historical data; hence, the need for a data warehouse.
 

Google Cloud

So why did Converge choose the Google Cloud Platform?

  • Ease of implementation
  • Less costly due to no physical hardware
  • Ease of scaling up/down
  • Pay per use

Also, Google Cloud comes with a low cost trial. Initially, this helps you play around with some of the cloud applications before committing to a contract. The Google Cloud console provides an easy-to-use interface, where you can view various applications available on the cloud to launch and use.
 


 

BigQuery

BigQuery is the built-in data warehouse set up on the Google Cloud Platform. BigQuery allows users to upload data files directly from a local device, cloud storage, Google drive, or even put live streaming data to BigQuery tables.
 


 

These files are usually uploaded in JSON or CSV format. If a file is directly uploaded as CSV, the system automatically detects the data type of each field. At the same time, users can edit the schema if they do not want the system generated schema. Once the table is created, only a limited amount of changes can be done to edit the model. Also, the interface allows users to preview and run queries on the tables where query results can be exported into a CSV or Google Sheet. Users can directly pull tables or queries into Google Data Studio to build different reports and dashboards.
 


 

BigQuery enforces us to maintain the tables in the structure as Project -> Datasets -> Tables. We keep separate datasets for CRM, campaign, Google Analytics data. With each dataset we keep a different table for each entity such as applicants and inquiries in a CRM dataset.
 

Converge Data Warehouse Architecture

 


 

In our current architecture, we are receiving data from the client’s CRM, Google Analytics, and campaign data from various platforms such as Facebook, LinkedIn, and Google Ads. The client is exporting their CRM data daily in CSV format and keeping them on their file server. We created a custom script to connect to the client’s FTP (file transfer platform) for pulling these files into our cloud storage. The client’s campaign data is pulled into Google Sheets, utilizing a third party tool that is integrated with various advertising platforms. Using a custom shell script with gsutil & unix commands we pull the data from Google Sheets to the cloud storage. Both the shell scripts are deployed on virtual machine server, in other words, set up on Google Cloud. These scripts are scheduled to run at specific time daily. This server is also used for keeping our backup and log files. Once the files are in cloud storage, “data prep” flows pull and transform the files. Dataprep is a third party tool that’s integrated with Google Cloud which makes the ETL (extract, transform & load) process easy and interactive. Below is a sample screenshot of Dataprep.
 


 

For getting real-time live streaming Google Analytics data to BigQuery, we deployed an application on Google Cloud using a custom script reference.

Data warehouses on the cloud might be an excellent business solution to be considered if your institute has data stored in various source systems and have difficulty sharing data and reporting from these multiple data sources. It is simple and affordable to implement as a solution.
 

Samtha Reddy
Samtha Reddy
June 6, 2019