2. Introduction
• Previous presentation
covered what data is
• In this presentation we
cover where data comes
from and factors we need
to take into account when
gathering data for
processing
3. Data Sources
Data can be collected either:
• DIRECTLY
– Gathered from an original source
or
• INDIRECTLY
– Gathered from another source or as a by-product of
another operation
• In the world of business these would be described as
primary and secondary sources of data
4. Sources of Information
• Primary data is ...
data that you (or your organisation) gathers and interprets
yourself
• Secondary data is ...
... where another organisation uses the data you have
collected and interprets it for other purposes
5. Direct (Original) Data Sources
• Sale of an item in a
supermarket recorded at
EFTPOS terminal
• Data from sensors (e.g. a
weather station)
• Data collected in a survey (e.g.
a questionnaire or an
interview)
6. Indirect Data Sources
• Data collected for one purpose and used for another
– A credit card company collects data about your spending in
order to bill you each month. However, a secondary use of
this data is to build up a “profile” of your spending habits.
This data can then be used to send you direct marketing
about goods and services that may appeal to you.
Direct Use Customer
of Data Billing
Credit Card Transaction
Indirect Use Direct
of Data Marketing
7. Indirect Data Sources
• Purchased data/data passed on
– There are a number of ways data can
be acquired from 3rd parties and then
used for a different purpose
– A good example is the electoral roll.
Its main use is to gather data about
who is eligible to vote.
However, marketing companies make
extensive use of the roll to target
customers.
11
8. Coding Data
• Before being stored in a This represents the
computer information can be eighth week of 2006
coded as data e.g.
– M or F
– Mo, Tu, We, Th, Fr, Sa, Su
– I, II, IIIM, IIIN, IV, V
– S, M, L, XL, XXL
• In the picture shown we can
see the date code for the tyre
9. Benefits of Coding
• Less storage space is required
– M and F require less storage space than male and female
• Faster data input
– See above
• Validation is easier
– With a limited number of codes it is easier to match them
against rules to check they are entered correctly
10. Drawbacks of Coding
• Precision of data can be lost Data in
(coarsened)
– In the example all shades of
blue are coded as “blue” Pink Blue Black Blue
• The user needs to know the
codes used Stored data
– How many of these top level
domains do you know?
– au, ch, de, ie, pk, fr, il, lk, es
11. Coding Value Judgements
• Coding value judgements can be a particular problem as
they are subject to personal opinion
• What do you think of this presentation?
– Good? Average? Poor?
– One person’s good may be another person’s poor!!!
• Value judgements are very difficult to encode without
some coarsening (loss of detail)
• How would you improve the analysis? What are the
time/cost implications?
12. Quality of the Data Source
• GIGO (Garbage In Garbage
Out) Garbage In
• If data input is poor the
resulting information
output will be poor i.e.
corrupt, inaccurate etc.
Garbage Out
• Can you think of any “real
life” examples?
13. Quality of the Data Source
Examples of GIGO can include:
• Unreliable questionnaires/surveys
– e.g. inappropriate samples, badly
worded questions etc.
• Incorrectly calibrated instruments
– e.g. an incorrectly calibrated balance
will give incorrect measures of mass
• Human error
– e.g. transcription errors when entering
data
• Incomplete data sets
– e.g. failing to account for “shrinkage”
when measuring supermarket stock
14. Summary
• Data can arise from direct and indirect sources
• Information can be coded as data
• This has a number of benefits but can lead to
coarsening
• The source/accuracy of data has a major impact
on the quality of information produced i.e. GIGO