2. 2
DecideRuleSequence
A list of DecideRules:
• Processed in order
• *Every rule is processed
• Result will be:
• ACCEPT: URI is rule in scope
• REJECT: URI is ruled out of scope
• PASS: DecideRule has no effect
*DecideRules can have a onlyDecision method to skip processing if they can’t change the outcome.
3. 3
UK Domain Crawl 2014
RejectDecideRule REJECT everything by default
SurtPrefixedDecideRule ACCEPT .uk, london, etc.
MatchesRegexDecideRule ACCEPT; try to capture media files
HopsPathMatchesRegexDecideRule ACCEPT anything embedded on a seed; ^E*$
*ExternalGeoLocationDecideRule ACCEPT IP addresses in GB
*OnDomainsDecideRule ACCEPT specific domains; disabled by default
*HopsPathMatchesRegexDecideRule ACCEPT redirects from seeds; ^R+$
CompressibilityDecideRule REJECT highly (in)compressible URIs; experimental
*TooManyHopsDecideRule REJECT URIs more than 20 hops from a seed
*MatchesListRegexDecideRule REJECT specific patterns
PathologicalPathDecideRule REJECT URIs with more than 3 recurrences of a pattern
TooManyPathSegmentsDecideRule REJECT URIs with more than 15 path segments
SurtPrefixedDecideRule ACCEPT URIs matching a list of URL-shortening services
SurtPrefixedDecideRule REJECT a list of SURTs from a file—exclude.txt
PrerequisiteAcceptDecideRule ACCEPT prerequisites
4. 4
UK Domain Crawl 2014
Basic flow:
• REJECT everything.
• Look for reasons to ACCEPT content.
• REJECT anything you absolutely do not want.
Those marked with a ‘*’ are specified using Spring’s <ref bean…/> syntax. This
facilitates changing their values dynamically or with Sheets.
Those struck out are disabled by default (and typically enabled using Sheets for
specific sites).
Experimental DecideRules:
• ExternalGeoLocationDecideRule: found 2,544,426 new hosts.
• CompressibilityDecideRule: REJECTed 1,650,861 URIs.
5. 5
Beyond Scoping
All Processors have a shouldProcessRule property—you can use DecideRules
instead of simple true/false values.
We use this to filter viral content:
<bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor">
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<bean class="org.archive.modules.deciderules.AcceptDecideRule"/>
<bean class="uk.bl.wap.modules.deciderules.AnnotationMatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="regexList">
<list>
<value>^.*stream:.+FOUND.*$</value>
</list>
</property>
</bean>
</list>
6. 6
logging.properties
In the logging.properties file:
org.archive.modules.deciderules.DecideRuleSequence.level=FINEST
This will output the decision of every DecideRule for every URI.
This will generate a lot of log entries.
Only practical for small crawls or testing.
7. scope.log
Alternatively, a recent addition to the DecideRuleSequence:
<property name="logToFile" value="true" />
This will create a file, scope.log, containing the final decision for every
URI along with the specific rule which made that decision:
2014-11-05T10:17:39.790Z 4 ExternalGeoLocationDecideRule ACCEPT http://www.jaymoy.com/
2014-11-05T10:17:39.790Z 0 RejectDecideRule REJECT https://t.co/Sz15mxnvtQ
2014-11-05T10:17:39.790Z 0 RejectDecideRule REJECT http://twitter.com/2017Hull
7