WSO2's API Vision: Unifying Control, Empowering Developers
Solr Extracting Data
1. Solr Extracting Data
● Start this session with a full Solr indexed repository
– Movie cAiYBD4BQeE showed installation
– Movie Th5Scvlyt-E showed Nutch web crawl
● This movie will show how to
– Extract data from Solr
– Extract to xml or csv
– Show aim to load into data warehouse
● This movie assumes you know Linux
3. Checking Solr Data
● Data should have been indexed in Solr
● In Solr Admin window
– Set 'Core Selector' = collection1
– Click 'Query'
– In Query window set fl field = url
– Click Execute Query
● The result ( next ) shows the filtered list of urls in Solr
5. How To Extract
● How could we get at Solr data ?
– In admin console via query
– Via http solr select
– Via curl -o call using solr http select
● What format of data – that suits this purpose
– Xml
– Comma separated variable (csv)
6. How To Extract
● We want to extract two columns from Solr
– tstamp, url
● We want to extract as csv ( csv in call below could be xml )
● We want to extract to a file
● So we will use an http call
– http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv
● We will also use a curl call
– curl -o <csv file> '<http call>'
7. How To Extract
● Ceate a bash file in Solr install directory
– cd solr-4-2-1/extract ; touch solr_url_extract.bash
– chmod 755 solr_url_extract.bash
● Add contents to bash file
– #!/bin/bash
– curl -o result.csv 'http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv'
– mv result.csv result.csv.$(date +”%Y%m%d.%H%M%S”)
● Now run the bash script
– ./solr_url_extract.bash
8. Check Output
● Now we check whether we have data
● ls -l shows
– result.csv.20130506.124857
● Check the content , wc -l shows 11 lines
● Check the content , head -2 shows
– tstamp, url
– 2013-05-04T01:56:58.157Z,http://www.mysite.co.nz/Search? DateRange=7& ...
● Congratulations, you have extracted data from Solr
● It's in CSV format ready to be loaded into a data warehouse
9. Possible Next Steps
● Choose more fields to extract from data
● Allow Nutch crawl to go deeper
● Allow Nutch crawl to collect a lot more data
● Look at facets in Solr data
● Load CSV files into Data Warehouse Staging schema
● Next movie will show next step in progress
10. Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems