The 7 Things I Know About Cyber Security After 25 Years | April 2024
Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations
1. Data Warehousing and Mining
Data from Library and University
Systems for Assessment of Library
Operations
ENUG Conference
Cheng Library, William Paterson University,
Wayne, New Jersey,
Thursday, October 21, 2010
Ray Schwartz,
Systems Specialist Librarian
Cheng Library, William Paterson University,
Wayne, New Jersey, USA
schwartzr2 @ wpunj.edu
2. Outline
• What is Data Mining and Data
Warehousing and Why Do We Do It?
• Our Library and University
• Patron Statistical Categories
• Application Server
• Reporting
2
3. Collecting Transactional Data
• ILSs collect transactional data for circulation
and allocation of collection funds.
• ILL and Document Delivery services supply
general transactional data.
• Reports from vendor services
– Bibliographic utilities
– Subscription agents
– Book jobbers
3
4. Collecting Transactional Data
cont.
• Most ILSs have search and web server logs
• Most (if not all) Databases have usage reports
• Link Resolver logs
• Proxy Server logs
• Many other ways of collecting transactional
data.
– Gate counts
– Reference transaction counts
– Reshelving counts
4
5. What would we like to see?
• Breakdowns by department and majors.
• Combined usage by department/majors
of more than one library service.
5
6. What is Data Mining and Data
Warehousing
• Extracting data from legacy systems and other
resources;
• cleaning, scrubbing and preparing data for decision
support;
• maintaining data in appropriate data stores;
• accessing and analysing data using a variety of end
user tools;
• and mining data for significant relationships.
• Chaffey, D., Mayer, R., Johnston, K., & Ellis-Chadwick, F. (2002). Internet Marketing:
Strategy, Implementation and Practice (2nd ed.). Financial Times/ Prentice Hall.
6
7. • The primary purpose of these efforts is
to provide easy access to specifically
prepared data that can be used with
decision support applications such as
management reports, queries, decision
support systems, executive information
systems and data mining.
• Chaffey, D., Mayer, R., Johnston, K., & Ellis-Chadwick, F. (2002). Internet Marketing:
Strategy, Implementation and Practice (2nd ed.). Financial Times/ Prentice Hall.
7
11. Our Systems
• Voyager ILS
• Clio ILL Software
• EZProxy Server
• Banner – University ERP
• University Networked Drive K:
• University Email Server
• University Web Server
11
12. Vendor Services
• Serials Solutions
• A to Z list
• MARC Record Service
• Link Resolver
• OCLC – Bibliographic Utility
• Worldcat Collection Analysis
• Coutts (was Blackwell) – Book Jobber
• Ebsco – Subscription Agent
• Marcive – Authority Control
• Database Vendors 12
18. • Voyager Patron Database allows a maximum
of 10 statistical categories per patron record.
• Decide which statistical categories are needed
for each patron group defined.
• Work with your University Information Systems
Department to extract the relevant data from
the relevant sources.
18
19. Groups and Services
• Major • Circulation
• Status – Books
– Media
– Undergrad or Grad
– Reserve
– Faculty, Adjunct Faculty or
– By Fund Code
Staff
– Location
• Department
• ILL / Document Delivery
• College • Databases
• Degree • Library Web Pages
• No. of Credits – Subject Area Resource Guides
– Reference Requests
• Year of Study
• Catalog
• Campus Location • Other Vendor Services
– Serials Solutions
19
20. History Department - 12 months - Feb. 2008
%
BORROW CIRC/ CIRC/
PATRON STATUS BOOK CIRC MEDIA CIRC EQUIP CIRC TOTAL CIRC MEMBERS BORROWERS ING MEMBER BORROWER
UNDERGRADUATE
STUDENTS 2,715 250 698 3,663 238 186 78% 15.39 19.69
GRADUATE
STUDENTS 419 13 76 508 14 13 93% 36.29 39.08
ADJUNCT FACULTY 100 65 20 185 32 20 63% 5.78 9.25
FULL-TIME FACULTY 159 115 194 468 24 23 96% 19.50 20.35
HISTORY TOTALS 3,393 443 988 4,824 308 242 79% 15.66 19.93
LIBRARY TOTALS 23,370 8,713 20,703 52,756 7,418 4,981 67% 7.11 10.59
DEFINITIONS:
BOOK CIRCULATION = books, book disks, maps, oversize, Curriculum materials, reserve books, NJ History, Leisure Lounge
MEDIA CIRCULATION = audio & video materials, including media reserves
EQUIPMENT CIRCULATION = camcorders, overhead & data projectors, laptops, easels, DVD players, etc.
MEMBER = declared major or department member
BORROWER = any member who borrowed materials
Library Total = declared undergrad & grad majors, adjuncts & full time faculty borrowers
20
21. Communications Majors FY08/09
Communications
Statistical Categories // Item Type / Location / Call No Type / Call No Majors Freshman Sophomore Junior Senior
M- DVD / Media Services / Other / DVD 194 17 31 52 94
M- VideoCass / Media Services / Other / VC 228 11 40 67 110
T- Book / 2nd Floor - Circulating / Library of Congress / B 34 9 8 11 6
T- Book / 2nd Floor - Circulating / Library of Congress / BD 3 1 2
T- Book / 2nd Floor - Circulating / Library of Congress / BF 30 5 5 12 8
...
2nd Floor Circulating 1531 222 310 403 596
T- Juvenile / CMC / 125 14 26 20 35
T- NJDoc / Askew Documents Room / Other / 1 1
New Jersey History 10 0 2 7 1
T- ReserveBk / Reserves Desk / 189 13 46 68 62
T- SpecColl / Special Collection / Library of Congress / LC 3 3
T- Book-McNaughton / Leisure Lounge / Library of Congress / F 2 1 1
T- Book-McNaughton / Leisure Lounge / Library of Congress / HF 1 1
T- Book-McNaughton / Leisure Lounge / Library of Congress / HS 2 2
T- Book-McNaughton / Leisure Lounge / Library of Congress / HV 5 1 2 2
T- Book-McNaughton / Leisure Lounge / Library of Congress / ML 1 1
T- Book-McNaughton / Leisure Lounge / Library of Congress / PN 3 3
T- Book-McNaughton / Leisure Lounge / Library of Congress / PS 29 4 10 15
T- Book-McNaughton / Leisure Lounge / Library of Congress / RC 2 1 1
T- Book-McNaughton / Leisure Lounge / Library of Congress / TL 1 1
Leisure Lounge 49 9 1 19 20
21
22. Challenges with combining
data from various services
• Little to no linkage of data
• Multiple user IDs for authentication
22
24. What is an Application Server?
• A machine or its software that works in
conjunction with a web server to deliver
application services such as the dynamic
creation of a webpage from content stored in a
database. From http://www.webtools.ca.gov/help/Glossary.asp
• Web Server Software (Apache or IIS)
• Database Management System – DBMS (MySQL,
Oracle, MS SQL Server)
• Scripting Language (Perl, PHP, ColdFusion, ASP)
24
25. Why an Application Server?
• Relevant data in logfiles need to be in
a database to be analyze.
• Need your own DBMS to create new
tables and queries.
25
26. • Decide how you will use the Application
Server.
• Decide on the best and most plausible
configuration.
26
28. Daily and Weekly Email
Reports from the Application
Server
Circ Fines Audit Daily Report - Daily at 6:05 AM.
Dupe Patron Record Report - Daily at 5:56 AM.
Hobart Media Services Equipment Pickup Summary - Daily at 6:58 AM.
Media Service Scheduling Rooms Report - Daily at 6:02 AM.
Media Services Equipment Pickup Summary - Daily at 7:00 AM.
Received Title Alert - Daily at 6:59 AM.
Reserves Overdues - Daily at 5:59 AM.
Scheduled LIS Tasks - Daily at 6:00 AM.
ILL Borrowing Overdues Report - Weekly at 5:59 AM.
ILL Lending Reports - Weekly at 6:15 AM.
28
29. Monthly Email Reports from
the Application Server
Circ Fines Audit - Monthly at 6:10 AM.
Circulation by Location and Item Type - Monthly at 6:21 AM.
Circulation Lost and Paid - Monthly at 6:25 AM.
Circulation Online Renewal Count - Monthly at 6:30 AM.
Media Circulation - Monthly at 6:35 AM.
Reserve Circulation - Monthly at 6:40 AM.
29
32. Lending Services Reports
Lists of patrons with fines between $10 and $19.99
• Student and Alumni fines list - Sorted by either Name, Amount or Notice Date.
• PALS and Courtesy Patron fines list - Sorted by Name.
• All other Patron fines list - Sorted by Name.
Lists of patrons with fines over $19.99
• Student and Alumni fines list - Sorted by either Name, IID, Amount, Notice Date or
Notes.
• PALS and Courtesy Patron fines list - Sorted by Name.
• VALE Patron fines list - Sorted by Name.
• All other Patron fines list - Sorted by Name.
Lists of patrons with overdues older than 30 days
• Student and Alumni overdues list - Sorted by either Name, IID or Notes.
• PALS and Courtesy Patron overdues list - Sorted by Name.
• All other Patron overdues list except VALE - Sorted by Name.
32
33. Lending Services Reports, cont.
Lists of VALE patrons with overdues older than 6 months
• VALE patron overdues list - Sorted by Name.
Miscellaneous Reports
• Patrons with the word "Collection Agency" or "CA" in their notes.
• Patrons with the word "FINE" in one of their notes.
• Patrons with the word "SOILS" in their notes.
• Patrons with the word "FALL07 SOILS" in their notes.
• Patrons with the word "HOLD" in their notes.
• Combined list of HOLD, FINE, and CA.
Circulation Reports by Item Type from 2003 to the present
• All Staff.
• All Colleges
• Undergraduates by Major.
• Graduates by Major
• Patrons that have reached a total fine balance of $10 or more after 31-Dec-2009
and 30-Nov-2009 33
34. One of Our Projects
• Mining EZProxy logfiles and linking to patron
statistical categories from the Voyager Patron
Database
– What majors and departments are accessing
which database services?
– What majors and departments are accessing
the ILL services?
34
35. ILL request form authentications by major
Article Book
Count Major Count Major
62 M- Psychology 90 M- History
60 M- Sociology 28 M- Non-Degree
42 M- Applied Clinical Psych 25 M- Pub Pol & Intl Affairs
35 M- Education 20 M- Spanish
31 M- History 18 M- English
30 M- Spanish 16 M- Undecided
29 M- Nursing 14 M- Art
M- Communication 14 M- Education
19 Disorders 11 M- Sociology
19 M- Communication 10 M- Biology
14 M- Biotechnology 9 M- Music
14 M- Counseling 9 M- Special Programs
14 M- English 8 M- Psychology
12 M- Non-Degree 7 M- Biotechnology
10 M- Community/Sch Health 7 M- Political Science
7 M- Biology 6 M- Anthropology
7 M- Political Science 6 M- Music - Jazz Studies
6 M- Undecided 4 M- Business
5 M- Comm Media Studies 4 M- Communication
5 M- Reading 4 M- Nursing
4 M- Business
35
44. • Active Circ transactions are stored in a
table with patron ID and statistical
categories.
• Completed Circ transactions are stored
in a table without the patron ID, but still
with the patron statistical categories.
• The Patron Table contains the total
counts of transactions for each
patron, but no link to which transactions
they are.
44
45. • EZProxy transactions would be stored in
one table with patron statistical
categories, but without the user
ID.
• User ID s would be stored in another
table with counts for each service divided
by academic
year.
• Logs are collected monthly and loaded
and deleted monthly.
45
46. Example of EZProxy log entry
• Ip address nj.dhcp.embarqhsd.net
• (Not used) -
• user id theuser
• date/time 1/1/2008 4:25:15 AM
• Method GET
• page http://ezproxy.wpunj.edu:2048/connect?session=sGHMbeSss121YxZ
a&url=http://www.wpunj.edu/scripts/webscript.exe?fs.scr
retrieved
HTTP/1.1
• Version
302
• response
code
• no. of bytes 537
• Referring http://ezproxy.wpunj.edu:2048/login?url=http://www.wpunj.edu/scripts/
URL webscript.exe?fs.scr
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR
• User agent 1.1.4322)
46
47. Perl Script for loading ezproxy
log into MySQL
use strict;
my
%month=(Jan=>'01',Feb=>'02',Mar=>'03',Apr=>'04',May=>'05',Jun=>'06',Jul=>'07',
Aug=>'08',Sep=>'09',Oct=>'10',Nov=>'11',Dec=>'12');
while (<>){
my $pattern =
'^(S*) (S*) (S*) (S*) '.
'[(..)/(...)/(....):(..):(..):(..) .....]'.
' "(S*) (S*) (S*)" '.
'(d*) (-|d*) "([^"]*)" "([^"]*)"';
if (m/$pattern/){
my ($tgt,$ref,$agt) = (esc($12),esc($16),esc($17));
my $byt = $15 eq '_'?'NULL':$15;
print "INSERT INTO ezproxylogs VALUES ('$1','$2','$3',".
" TIMESTAMP '$7/$month{$6}/$5 $8:$9:$10','$11','$tgt',".
"'$13',$14,$byt,'$ref','$agt');r.";
}else{
print "--Skipped line $.n";
}
}
sub esc{
my ($p) = @_;
$p =~ s/'/''/g;
return $p; 47
}
48. Created table to assist the
linking
SELECT PATRON_ADDRESS.ADDRESS_TYPE,
Left([ADDRESS_LINE1],InStr([ADDRESS_LINE1],"@"
)-1) AS usr,
PATRON_ADDRESS.PATRON_ID,
PATRON_ADDRESS.ADDRESS_STATUS,
PATRON_ADDRESS.EFFECT_DATE,
PATRON_ADDRESS.EXPIRE_DATE,
PATRON_ADDRESS.MODIFY_DATE,
PATRON_ADDRESS.MODIFY_OPERATOR_ID INTO
emailprefix
FROM PATRON_ADDRESS
WHERE
(((PATRON_ADDRESS.ADDRESS_TYPE)="3"));
48
49. Reporting and Standards
• Reporting
– Emailed periodically - e.g., daily dossiers,
and other event triggered reports.
– On demand, via email, web pages or a
printer.
• Standards
– Share data for comparative research.
– Groups of libraries and consortia
49