Uses of ware house and its scale in era of today | Importance of ware house
The most important property of data
warehouse so if you guys had any problem the previous slide then I'm pretty
sure this is something you have really understood after this slide because this
is probably the epitome of for data warehouse right these three properties if
you understand these four properties then you are pretty much ready to
understand the next part the next or you are ready to go to the next step and
data warehousing right so finally you should have understood how important all
this is why I know how important business intelligence is and what can improve
the data warehouse place you can just think of how among us our deal a data
warehouse is correct yeah so a couple of people are satisfied with that so in
order to spoken about the properties
let's go to the next slide and okay now we have to talk about key terminology
okay so right now we've
understood data warehousing from a higher level okay now
let's dig deep let's go to mode
basics okay let's understand the key terminology that are all related
and that
are involved in a data warehousing so first of all we have LTP and OLAP okay now
there are four things I will
talk about the differences between OLTP and Ola okay then I'll talk about
UDL I'll
talk about data Mart and then
finally about metadata okay so let me go to the first topic that is Ltd
versus OLAP
okay so
in this part which is l DB over DT stands for online transaction processing okay now
this is something that is the
representation of data fir database if you're running any kind of queries
on your
database then that's called online transaction processing okay and then o is last
stands for online analytical
processing and this is the property of a data warehouse so any
kind of query or any kind of
analysis that you run on your data warehouse that's called as an
OLAP activity
correct so let's go to the
differences between the two so first of all any data that is stored in a relational
database right in an which
involves OLTP that contains the current data as well as past data okay current
data as well as path data but
with respect to data warehouse and while performing an OLAP you will be
dealing with
only historical data here okay it contains only historical data and the data that
will be stored in your
database okay when you use the OLTP then those queries will be useful and
running your
business okay when you have to run your business like if you want to store
the data of the number of sales
that has happened today like every time a field happens then your records in
your database as we updated right so that's what we say so when you update your
record with the latest details of your customer then that is what is the meaning
of useful and running the business okay but in order for something that is
usually analyzing the business so here the kind of activities that you would do
is that of finding out details like at what time how many customers bought the
products or at what time which customer but which all products all these kind
of questions will be answered with as both hola okay and then the whole OLTP
model of accessing data of accessing or querying data on a database is based on
the entity relationship model okay but whereas with the data warehouse it's
based on the star schema or the snowflake schema and the fact constellation
schema okay so it's called also called as the Gatsby schema so
all these three
things will come into picture okay and then your relational database it
provides a primitive and a highly detailed data so since you'll have one
database if you run one kind of a query like a select star from this particular
table then it would give you all the details that are stored right so you
cannot filter too many details with respect to the data that is stored in a
database okay of course you can but the level of you know the filtering and
analysis that you can do is not that much so the reader that you will get back
from your query write the result that you will get back from a query will be
highly detailed okay and it will not be exactly what you want so it will not be
that accurate but whereas with pull up when you do an OLAP on a data warehouse
it provides summaries and the consolidated data Never land point to you exactly
what you want to look for correct so it will it's a very processed data and it
points to one particular aspect which matters the most so that's what this is
and then the oil TB you use this for writing data into the database okay so
ever like I said the same sales example whenever a new sales happens your
database has to be updated right with the new records of the product sold of
the customer who bought the product and all these things so you basically use
it for writing data into the database
but your data barrows is primarily used for reading data from the data
warehouse so right into the data warehouse or something that is done so that you can do the reading from the data
warehouse ok the primary concept here is to read the data from the data warehouse
and to do the analysis and all the visualization actually but with the database
it's more of writing the data into the database alright and the size so
speaking of the size a databases size would raise anywhere between hundred me to
1gb ok and this is also a very big number 100 MB is a very less number I would
say and one tip is also very less so it would typically be much more than this
but come back to this range if you look at a data warehouse a data warehouse is
ranges from 100 GB to 1 TB correct so your data wells will have all the historic
Durant and it will have all the relationships between the different data right
such that so you can do your analysis straight away so since it makes all the
data more efficient and stuff the data here the size ranges from 100 GB to
almost 1 P so that is what our data warehouses and that's the part of data
warehouse all right and I can actually show you the difference between the two
in today's demo session ok later during this session I'll show you that the
size of the source file that I have added with the different sizes of the two
files that are they using as the source and then after the UH processed the
data are all sorted in my data warehouse I'll show you what is side of that
data so there will be a big difference between the two okay and the red arrows
will be modes in the database
I will show you that aspect later
all right so and that's about this point and then yeah of course this will be
fast ok database is fast and it provides high performance alright so your data
where owes of course it's not as fast as your database but it's how you sexy it's
highly flexible because it gives you different views so you have something
called as the OLAP cube right so using the OLAP cube you can get the you can
look at insights from different angles different perspectives and different
views of data we get so that is the big advantage sure okay and the number of
cycles that is accessed it is intense but whereas with the data warehouse the
number of records access is in millions all right an example of this can be all
the bank transactions made by a particular customer so if there is one customer
and whatever transactions he's made you will get all those or you know all
those videos right
so supposing take the example of
any email statement that you request for any email statement or a bank account statement
all these things so they are an example of OLTP so whatever radials are present
in the database those will be given to you but whereas the bank transactions
made by your customer at a particular time okay this is a most filtered query
and the answer also will be a very accurate and point Q to exactly that
particular question so that's what an OLAP is so it's normal to be one overview
here is one exactly point you to what you want it's not very much in detail but
it's more accurate correct so that's what an app is and that's
the difference between LTP and OLAP okay so
there are two strategies that are used for you know accessing data well DB is
used for accessing data in a database and we'll appreduce for accessing data in
a data warehouse alright so I hope I mean this is cured everyone all right guys
okay if you have any doubts then that will be cleared by this slide okay because
the examples are there for both these strategies so if you want more examples
of an OLTP then one would be that of a supermarket server which records every
single product purchase at that market okay so every single perk in their
history or probably in the last
one month all these things can be
accessed using your OLTP okay from your database so you don't have options to
do much of filtering here and then another example that of a bank server which records
every time a transaction is made for a particular account every time or transaction
is made in a particular account the data will be updated in that table and you
can create that kind of data and you'll get that result okay
another example is our railway reservation
server which records the transaction of a so whatever equities that passenger
does so all these things will be recorded by the reservation server right so
this is one example but when you look at the OLAP there will be much more
detail the query juror will be much more detail and the answers will be much
more accurate and very crisp okay so
an example is a data from bank manager wants
to know how many customers are utilizing the ATM of his branch okay because
maybe based on that he may take a call whether to continue that idiom or
relocate that idiom to a different place right so this guy this bank manager
would want to one of us for understand if there's any use and having the ATM in
that place are people using it correct if people are using it then how many
people are using it at what time are they using it or would it be better to
have the ATM in a different location where it's much more easier for people to
access it so all these kind of questions will be answered only if you have the
oil up in place okay only when you are a data warehouse in this not with the
database
so another example is that of an
insurance company that wants to know the number of policies each agent has sold
okay this will happen a better performance management of agents so you'll have multiple
agents in your company and you'd want to know their performance right you know
you would want to know who is the best performing agent you would want to know
how they're performing why is this person performing better you can you know
optimize each of their performances so all these things can be done with the
help of data barrows and they can be done with help of Allah right the OLAP
strategy the OLAP activity that is done on a data warehouse so that is a
difference between the two right so I hope you're getting the difference as
your guys right OAP and OLAP okay since it's all clear let me go to the next
slide now
the second important terminology
that we have with data warehouse is that of ETL extract transform and load so
by going by the definition usual is the process of extracting the data from
various sources transforming this data to meet your requirements and then
loading the data into a target data warehouse okay you are exciting the data
from here right you're extracting it and then you're transforming it into the
way you want in a more readable fashion in a more relatable fashion then once
that is done you load the data into a data warehouse and the whole process of
for getting rid of from your data source to your data warehouse this done by
the EDL the activities of extraction extraction transformation and loading so
we have got popular tools for for this very process so you have tools like
talent informatica you have open all these things and informatics and talent
are probably the most popular tools for this process for extraction transformation
and loading data into a data warehouse right so this is something that usually
understood by now and it out guys because I don't want to waste much time
I want to go to the next slide
and teach the next concept ok great so the next one is data Mart okay now if
you've understood so far to LA ETL then half your job is done because data Mart
is something that's close to a data
warehouse and you don't have that much of a difference when it compared it to
our data warehouse but the basic difference between the two is the data Mart is
just the same data warehouse itself but a smaller version ok so let's look at
the line the definition here the data Mart is a smaller version of the data warehouse
which deals with a single subject ok the data Mart's are focused on one area
hence they draw the data from limited number of sources and the time taken to
build the data match is very less compared to the time taken to build a data
warehouse now to give you
an example or an explanation of
this in your data warehouse you will have all your details right all your raw
periods that you have so this itself is more refined ok but your data mods are smaller
versions of that data warehouse which is used to satisfy only a certain users
supposing you have one particular user base ok that is your sales team or your
sales manager who wants to use only or receives data then that can be done with
the help of a data Mart and similarly you have a marketing team who wants to
access all the marketing data then that can be done by using a data Mart ok and
then again you have other operations team that wants to do wants to access the
operations data then you can give them only access to that data by using data
Mart separately ok now the need for data Mart is first of all your data warehouse
will show all the data ok your sales data marketing data operations data so all
the data coming from different data sources so you'll have multiple data
sources right all that will be stored in one single place and from here this
will again like separate data sources right so for the sales team this will act
as a data source for a marketing team this lap as a data source and then for
the operation seen this Mart will act as a data source now the now as in the
question as to why we need them it's to you know probably more security to
enable more security and integrity because since all your data will be there in
your data warehouse and if you let all you’re
your entire company access the data warehouse
they will get access to all the
details and all the all the work that is done by other teams okay so there
might be multiple teams and every multiple
strategies which you might not want to reveal other rates so at that
enough time you can divide your data warehouse such that only this particular
users okay your CL user or a particular user base gets access to only certain
data from your data warehouse so at that kind of a time a data Mart is useful
or you can have a second data Mart for only another set of user base for
example your marketing users so they will get access only the marketing data
from your data warehouse right and then your operations data similarly which
can be accessed only by your operations force so that's the different
advantages that you have since the data is all divided it's all stored
separately different people have access to different parts of your data warehouse sothis probably brings more advantages right
so that's the data Mart and that's why I told you that you know if you understood
so far till then half your job is done but data Mart or something like that
extends the functionality of your data warehouse right so that is the thing and
stinking off the differences that is there in this table the data warehouse
will store the enterprise-wide data right the enterprise of the enterprise wide
data whereas data much will store it departments wise data there'll be multiple
departments in the whole company in the
enterprise and they will sort
department-wide data and then the data warehouse
it will have multiple subject
areas okay but a data Mart will have a single subject area there'll be multiple
data sources in case of a data warehouse okay but in case of data Mart's
they'll be limited data source in fact they can be just one data where all right
this will act as a source pure data
match but however we have gossiped limited your because not always do you need
a data warehouse there are also instances where your data source itself as an
input data model okay now that is something that you will answer next slide
okay so just don't get transfused when
we see that limited resources it does not mean this is the only source they can
also come from a data a property a source like a flat file or from a data base
and all these things all right and then
a data warehouse it occupies largely
this of course because there is a lot of data on to price where data will be
stored your multiple subject areas will be dead because of that there's larger
data here at state and that's why there is greater memory that's occupied okay
but in case of your data mod it occupies limited memory because it's very crisp
and limited to only a particular department okay and then the other thing is
data where else is long it takes longer time to implement
but a data Mart is a takes very
short time to implement it is once you have all your data warehouse or stuff in
place you can easily divide them by creating different data mass okay so be a
tough part is your data well so once you've got a data warehouse ordered you
can easily from your data mass in fact there is even the other way there are
also practices where you first bring your data mass and once you've done that
after that you create one single repository and that's when you create a data
warehouse so there's also two approaches here one is the top/bottom approach
and the other one is bottom-up approach so those are the two things and I'll go
to details about
these two approaches in my next
session ok of course I can't do it today because we have very limited time all
right so moving on to the next slide ok now speaking of the different types of
data mass ok this is what I was talking about in the previous slide so you have
three different types one is a dependent data Mart the other one is independent
data Mart the third one is hybrid data mod you are dependent data matters are
the data's first extracted from the OLTP systems and then populated in the
central data barrows and then from this data warehouse the Ritter travel to the
data
Mart so look at this example ok
this is the standard practice or the year the default approach where you have
an OLTP source then you get the data into a data warehouse and then from the
data warehouse you form a data model ok we'll have multiple data mass where
each different Mart will have particular data from the entire data warehouse ok
this is the first regular approach and then you have the independent data Mart
which is a slight variation compared to dependent so here the rears directly received
from the system okay you don't have a data barrels in place over here that is
what this line means and this is suitable for small organizations or smaller
groups within an organization so basically an organization which is very small
it might not need to go to the trouble of creating a data warehousing stuff so
you can just have an OLTP source and from
there they can just get the data in onto
a data Mart and they can use for various purposes okay that's what we
have an independent data Mart for so that's the difference it just does not
involve a major data warehouse so directly the data goes to a data Mart in fact
you'll have multiple data Mart over your you'll have great or not one data amount to one stuff okay which they coming directly from the OLTP source and
then you have a third type which is the hybrid
data Mart so by definition you might know what
this is right by your the name of by the name itself is pretty obvious it's combination
of these two the data here is fed from both the OLTP systems as well as the
data warehouse okay so look at this example for that tour instance you three
are here it's coming from the ODB source as well as from a data warehouse so
this is what the hybrid data models and
depending on your company depending on the size of your company the
requirements of your company or your organization you can choose one of these
you can model your entire database and data barrows in any one of these models
either the dependent or the independent data mass or hybrid Radames alright so
that's about the different types of data model so moving on to the next slide
we have something called as metadata now people here from programming background
or from the technology background you might all be aware of what America does
metadata basically is defined as data about data okay it contains data about
where your actual data stored supposing you have your raw data right where is
that data stored
what is the size of that data so
these are the answers to these kind of questions is what will be Plus on your metadata
your metadata will contain the location of your actual data it is not n the
size of your actual data it will contain details like which was the source it
came from and when was it created all these details will be stored in your metadata right so that's what made
our IRA's so that's how different our metadata is from regular data and a metadata
is specifically in a data warehouse it defines the Soraa there is a flat file a
tional database and other objects so the reason that we give so much importance
to metadata in a data warehouse is because take
the example of any company that's having a
24/7 business okay they have a rolling system that works throughout the clock
the 24/7 they are sales coming in data will be going into your database okay
now in this case you cannot always you know keep adding data into your data warehouse
because you know that data warehouse is not real-time correct so you have to
manually update your data into your data warehouse probably at every day at a
particular time maybe at 6 o'clock every day in the morning or maybe once in
the night at 10 o'clock or at basically at regular intervals you've got to so
data into your data warehouse and this whole strategy becomes difficult because
you have to do this process every day so every day when a guy comes in in the
morning he has to look at the new data that has come into your data database
correct and then from that data base which is data source he has to add that
adagio data warehouse now this process becomes difficult that is where a
metadata comes in handy with metadata you define the address where your data
sources okay you define the address of your raw of that file from my data's
coming in or your relational data from measure data is coming in and then you
can also impact us or the metadata of your raw data warehouse where you want
the data to go so it basically Ellen you know see a lot of your time okay
and this is the most common and
the most you know unspoken fact about data warehouse so any professional that's
dealing with data barrows you would always be using metadata because it saves a
lot of time because every time you cannot be importing data from a database you
always have to get it from your metadata by defining rules and defining your
source and your targets correct so mirror it has gives lot of your time and
this is something I will show you a demonstration on ok and when I show demonstration
you will understand how easy it is so we just read the rule says that you have
your source over here and then you have your destination over your target over
here and once you've written this thing you don't have to go back here so every
day it might pick up the data from here and it might move your data warehouse
and all these things it will update all the new details which are present in
your database and it will add it into your data warehouse so that's what a
metadata does
it's a very big advantage and
it's one of the best Sergey's are being okay and then the final point here is metadata
is used to define which table is source and target and which concept is used to
build a business logic called transformation to the acts alone but yeah this is
what I told you right so it's used to define which is your source and target
and how to build your business logic called transmission so transmissions
basically your rob act of converting your source data into the form that you
want to and what is the logic that you use correct so all the different the
filter criteria all the transformational criteria's all these things can be also done using the metadata
correct so even the process of forgetting data from your source your destination
it can involve extra additional steps which can save your time and default
process basically so every time you have data in a particular format you might
want to store
it in a different format and for
all that purpose you can use a metadata right so since things will already be
defined you're the work of metadata is just to get the data from the source
define and do the set of activities that is required and which is already
defined it will perform all those activities and it will sort in the case where
you want to do so that is the role of a metadata correct so many data is very important
very highly used and it's actually the most important or let's say the epitome
of data barrows okay this is probably the best thing that can happen to data
warehouse right so that is the thing about the major a degrees now going to the
next slide we have the architecture so now that you know everything okay
these four raw terminologies are
in our removing enough for you to understand the architecture okay so let's
understand what the architecture is so this is the entire architecture diagram
okay so this is what you know data comes in from various sources it can come
from either a database or in the from flat file and then that data an action of
ETL will be performed on that data and it will go to the staging area okay this
called the sitting area and this is the staging database and the rails are
stored over here it is temporary data before data completely moves to the data
warehouse it will be present in this area okay and that is done by using the
act of ETL and also between moving to the data warehouse the process of idle
continues so ETL process starts over here and it ends over here okay and
between the conversion it is a temporarily stored in a staging area and this is
most often present inside the ETL Coolidge okay
like your talent order semantics
and all these things and then this data will be stored in your data warehouse
so whatever is the extracted transformed unloaded it will be loaded into your data
warehouse and in your data barrows you'll have metadata okay and of course you'll
have your raw data and then you will have your aggregate data okay and this is
the reason why a data warehouse is generally you know it's a larger in size
because it has not just raw data or data base which from where the data is coming
in it will only have your raw data okay but your data warehouse will have
additional stuff here it will of course your metadata and your aggregate data
and together all these three things together is what helping you being you you
doing your analysis sooner okay this is what powers you're ready to do OLAP online
analytical processing okay so that's what a data warehouse is and that's what
it stores and there's this data warehouse you can either your entire company
can use the data warehouse or if you want more for security based access then
you can divide it into a different data mass where your sales team and your
different teams like here we have a purchase table and then you have a stock
table so
these are like three different maths data mass for three tables like a sales
purchasing stock so you'll have different tables here and different teams can
access different set of tables okay your purchase can be something that's used
by your operations people right and then this may be something that's used by your
raw sales people and your stock may be again used by your operations or your sales
group so that's what each of these your you know defines you have your user group
one user group two and user group three and each group will be getting access
to different parts of your raw data warehouse because your data valves will be
divided into different data Mart's and your data your different groups we could
access only that data which they want to or which they can get access to right so this way no group can get
access to every data that is personal data warehouse and there's advanced
there's a little more data security in this is okay same thing over here so
this group get action into this data and this data and this table or this mod
right and then this user group gets accessed only the Riddler is Belgian they
smart and then this Mart so that's what the entire architecture looks like and resolve
the data flow is right so if you guys have understood
this much then you're
ready to get a demonstration correct guys and this is also my last topic in
this presentation so right now I can go to my demonstration and what I'll be
doing in my demonstration is I'll be using talent okay so I'll be x importing
data from my database I'll be using be getting it from my Oracle database and
I'll be showing that into a data warehouse which can be ready for any kind of
analysis or visualization on any other visualization tool okay so this act over
here which I'm going to show you this is what powers your business intelligence
right so are you guys ready for the demonstration any doubts your guys okay
great Ronnie says he's already he's all pumped up already Ronnie that's very
good so Jacobs also ready and so is Rajesh are you Rajesh can I get a yes from
you yes very good olive oil Rajesh
so let's go to the next part of
my session and there's going to be demonstration where I will populate or detail
barrows okay I'll be using talent for the data warehouse activities and let's
see how we can import data from waste data sources and create or data barrows
so for our hands-on session I'll be importing data into my talent bi okay like
I said earlier we love using my talent bi for night data warehousing and ETL
processes and I will create a data warehouse out of using talent bi okay now
the data set that I'll be using is that of per 10,000 row table and 3,000 row
table okay so the there be one table having 10,000 customer details okay and then
there'll be another table having 50,000 rows of transactions which each of
those customers make okay now based on
this data we have to find those customers who have the lowest number of
purchases okay so right now my dataset is
present in my Oracle database okay and so I have two tables and both my tables
customers table and my transactions table is present in my Oracle database
I
also have them in my here excel so first let me show you how they are present
in my excel file okay okay so this is my freshmen table as you can see we have
all these days right we have all these fields here so we have customer ID we
have customer name we have the address of that customer city the country is
from the contact number and his email address right and if I do this then you
can see that there are almost 10,000 cutoff fields here so 10,000 rows are
there here so this means that you know I have it's a pretty big data set and I
can use this in competition with my other table okay so here we have customer
ID as the primary key okay and if I go back to my other data set that's my
transactions table you can see that I have furthermore freeze your Abdul for invoice
number of a stock code I have description app quantity I have invoice date unit
price customer ID and Product ID okay so there's a customer ID over here also
so the customer ID is the primary key for there and this is the foreign key
over here so basically the customer ID is the same and I will have to do a
lookup to that table with the help of my customer ID okay
since these are two different
tables I can create my data warehouse with the help of this particular primary
and foreign key concept I can create a link and I can create I can probably
just link these two tables with the primary key for a key concept okay so any
doubt guys I'm sure that you all know what a primary key and a foreign key is
and I fed all anybody has a problem please let me know okay so Rodney says I
don't know what's a foreign key okay so see Rodney the thing is we have
something called as the primary key and a foreign key okay and we use these two
columns when we want or when you want to use a combination of two different
tables right so if you would consider the example of this one then you can see
that there are a number of columns here and not everything is matching with the
radius in that put in the other table okay so only customer ID is the only
common field okay so there's a customer ID here and even in this table we have
a customer ID now the thing is the customer ID is a primary key because it's
it's constant and it's unique for each and every single record over here so
each and every customer here will have a
separate customer ID okay so no two customers will have the same customer ID
okay and over here the transactions here which has the inverse numbers the description
of the product data spot is the stock code the quantity all these things are
sorted by the customer writing okay so if you find a particular customer ID
more than once then well here you can find the person already more than once
that's because a customer would have made more than one transactions.