Matillion Series — Part 1
Matillion is an ETL Tool with an ELT Architecture. There are many Data Warehouses that can be used with Matillion. Here, we are using Snowflake as a Data Warehouse.
ELT — Extract, Load and Transformation
So, before starting or building an ELT process, you should know that in Matillion, ELT is divided into two main parts:
- Orchestration Job
- Transformation Job
The Orchestration Job is an area in which you can create the workflow for the Job from source to target. It is an arrangement of different components according to the job need with a specific starting point. The whole workflow can be designed as an Orchestration Job. It is majorly used for creating, dropping and also extracting data from various sources.
The Transformation Job is an area in which you can transform your data as per the requirements, it does not have any specific starting point as an Orchestration Job. In Transformation Job we can change the data which are already present within the tables.
In this project we are extracting data from Google Sheet. Then we are appending the new data to the Table in Snowflake and making sure the old data is still present.
By making a Google Sheet as shown below. I have given the name of the spreadsheet as ‘spreadsheet’ and tab name as ‘Sheet1’.
Then, in Matillion start by creating a folder. Just do a right click on default version and then you will see first option as shown in the image create a folder. I have named the folder as Matillion New Job here.
As, over here I have given the name of the folder as a Matillion New Job. Then just do a right click on the folder name and click on the option Add Orchestration Job. Then it will ask for the name of the Orchestration Job. After, you have given the name then it will pop up ‘Job Successfully Created’ and will say ‘Switch to new job now?’ click on ‘Yes’.
Go to the components section which would be right below the Development Section then Search for for ‘Google Sheets Query’ Component. Drag that thing to job near to the start component it will automatically gets connected to the start component.
Then, as soon as you click on the Google Sheets Query Component then in the ‘Properties’ Tab down below there will be multiple options we have to fill by clicking the icon show in Fig-5.
- Name — ‘Component Name’
- Basic/Advanced Mode- We will keep basic in this project, where as in advance mode you can write your on SQL for extraction from the component.
- Authentication- You can set up Mange OAuth by following the steps mentioned in this link.
- Spreadsheet name — ‘Add the name of the Spreadsheet’
- Container Header Row- If the first row in google sheet is the header name. Then, Yes.
- Connection Option- Not necessary it has default setting.
- Data Source- In Data source select the options which has your ‘Spreadsheet Name’_’Sheet Name’. For example here it is spreadsheet_Sheet1.
- Data Selection- When you will click on the data selection, you will find this panel, here you can select the columns you want in the output. Click on the icon directed by blue arrow if you want all the columns.
9. Data Source Filter- It is a filtering options on which you can ingest the data which follows certain conditions.
10. Combine Filters- keep it as ‘And’
11. Limit- No need to change
12. Type- Standard
13. Primary Keys- No need to Change
14. Select the options : Warehouse, Database, Schema, Target Table options as per your requirement.
15. S3 Staging area- whatever bucket you want to choose
16. Encryption- None
17. Load Options- Keep Default , it will recreate the table everytime when it runs.
18. Auto Debug- Off
Now, create a Transformation Job like we created Orchestration Job. Right click on folder name and select ‘Add Transformation Job’.
Hint: Give the Job name as 01 Orchestration Job
02 Transformation Job
So on to keep in order.
As, shown in the Fig-10., in the Section of Components search for Table Input then drag it to the transformation job. Keep the Database, Schema, Target Table same as given to the ‘Google Sheet Query’ Component as shown in Fig-9 and in Column Names select all columns same as shown in Fig-7.
Drag Table Update Component and Join it to the Table Input component.
Then, In snowflake create a table by running the query given below:
--Extracted table from Google Sheet Query Component
--You can also use Rewrite Component of Matillion
--Same as shown below in Fig-12CREATE OR REPLACE TABLE "DATABASE_NAME"."SCHEMA_NAME"."TABLE_NAME" AS
SELECT * FROM "DATABASE_NAME"."SCHEMA_NAME"."GOOGLE_SHEET_TABLE_NAME"
Run the above query in Snowflake to create target or final table.
Select the Database, Schema and Target Table as according to the Table created in Fig-12. Then, change the Join Expression option as shown in Fig-13.
Then, change the ‘When Matched’ options as shown above by clicking on the “+” and adding the case and operations column as shown above in Fig-14. So, when running this transformation job will delete all the rows from the target our main table which are having matching email.
Then, creating a transformation Job as shown in Fig-8.
Then, you can copy paste the Table Input Component from previous Transformation Job or else you can follow Fig-10 Steps.
Then drag the Table Output Component and attach it to the Table input component. Follow, the following steps given in Fig-15.
This final transformation Job will append all the data which are present in source table to target table.
Now, drag the components as shown in Fig-16. First with the transformation job with Table update Component. So, it will match according to email and delete the row which matches. Then, attach that with other transformation job that will append from that same table. So, what happens is that whenever the data is erased from google sheet and new applicants data is added, in this case it will retain the old data with different email address and it will update with the one having the same email address.
Let’s change the spreadsheet and run the job. So as to get better insights.
If you will compare the Fig-17 with Fig-1 you will see there is a change in the first row.
Now, do a right click on the Orchestration Job and run the job as shown above in Fig-18.
After the job gets successfully completed, go in Snowflake and check. You will see as shown below.
You, can see in the Fig-19 that the Jack data still exists in our database and in row number 2 you can see the new data appended.
Congrats, you have created a job in Matillion using Google Query Component.
You can get connected with me on Linkedin with more questions.