Pre-requisites:
To accomplish this I've used standard Unix/POSIX tools, so you'll need to use Cygwin if you're doing this on Windows (everybody that runs Windows has Cygwin installed anyway, right? :)).
wget: Make sure you are using a recent version of wget as we use quite a few command line switches, I'm using v1.13.4.
xmllint: Used for executing XPath against the returned XML DOM of our activity feed.
tr, awk: For standard stream processing / etc.
Atlassian JIRA: My tests are against v6.0 of the "On Demand" version of the product (in other words hosted by Atlassian), I'm hoping/guessing this will work for a locally managed "Download" version. You must have the "activity stream" gadget installed and accessible for the user profile for which you are performing this against.
Commands:
Make a directory to store the output:
mkdir jiraSuckLogin to Atlassian JIRA website by providing your username and password via POST data, saving the cookies so we can maintain the session (obviously replace my username with the your required username).
(please note in my example the JIRA server I am querying is the "On Demand" type which means it is hosted by Atlassian as a subdomain of atlassian.net):
wget --keep-session-cookies --max-redirect 0 --no-check-certificate --save-cookies cookies.txt --post-data 'username=sgillibrand&password=YOURPASSWORD' https://JIRASUBDOMAIN.atlassian.net/loginGet an XML stream of ALL your activity by asking for a large max results figure (999999):
wget --no-check-certificate --load-cookies cookies.txt -O jiraActivity.xml "https://JIRASUBDOMAIN.atlassian.net/activity?maxResults=999999&streams=user+IS+sgillibrand&os_authType=basic&title=undefined"We accomplish quite a few things with this next line. Note, as the returned XML representation of our activity is littered with numerous namespaces I'm using the local-name() functionality of XPath to enable us to operate in a namespace agnostic way. So first we extract all the HREFs/URLs for each JIRA issue mentioned in each activity entry, then we place each URL on a seperate line, delete the 'href=' and remove all double quotes. Next we remove all duplicate URLs and finally create an alternate URL based upon the current URL which will give us a printable version of the JIRA issue (this is handy is it contains all field data expanded) - all the output is redirected to a file jiraUrls.txt:
xmllint --xpath //\*\[local\-name\(\)\=\'entry\'\]/\*\[local\-name\(\)\=\'target\'\]/\*\[local\-name\(\)\=\'link\'\]/@href jiraActivity.xml | tr " " \\n | awk 'sub(/href=/, "")' | awk 'gsub(/"/, "")' | awk '!x[$0]++' | awk -F "/" '{printf "%s\n%s/%s/%s/si/jira.issueviews:issue-html/%s/%s.html\n",$0,$1,$2,$3,$5,$5}' >jiraUrls.txtChange into our newly created directory:
cd jiraSuckNow scrape all the JIRA issue standard HTML, printable HTML and specified attachments - this can take some time!
(change the acceptable file extensions and domains to suit your needs):
wget --no-check-certificate -nc -r -k -p -l 1 -E --accept=.jpg,.png,.zip,.7z,.rar,.html,.htm,.xls,.ppt,.xlsx,.doc,.docx,.pptx --restrict-file-names=windows domains=atlassian.net --load-cookies ../cookies.txt -i ../jiraUrls.txt
Some time later..............
FINISHED --2013-07-16 16:30:46-- Total wall clock time: 30m 49s Downloaded: 856 files, 373M in 25m 6s (253 KB/s)
Explanation of the resultant directory structure:
browse directory contains the normal HTML view of each JIRA issue.
si directory contains the printable view of each JIRA issue.
secure contains any attachments associated with each JIRA issue.
All pertinent links have been converted to point relatively to your local directory structure.
Job done :)
No comments:
Post a Comment