You may or may not have been following the Google Treasure Hunt competition, a puzzle contest designed to test your knowledge of Computer Science, networking, and low-level UNIX trivia (as described on the Google blog). It's also a way for them to find potential engineers to be assimilated --er, hired. I took one of the questions for a spin today, and thought I'd post my methodology to solve it. It probably wasn't the fastest way, but it worked for me; if anyone has suggestions for improvements, let me know! Here's the puzzle:
The site gives you a uniquely generated zip archive, full of directories and subdirectories and randomly named files, for you to download and extract. Their instructions from there (also generated uniquely for me, but with the same basic challenge each time):
Unzip the archive, then process the resulting files to obtain a numeric result. You'll be taking the sum of lines from files matching a certain description, and multiplying those sums together to obtain a final result. Note that files have many different extensions, like '.pdf' and '.js', but all are plain text files containing a small number of lines of text.
Sum of line 4 for all files with path or name containing bar and ending in .xml
Sum of line 2 for all files with path or name containing bar and ending in .txt
Hint: If the requested line does not exist, do not increment the sum.
Multiply all the above sums together and enter the product below.
(Note: Answer must be an exact, decimal representation of the number.)
And my solution, starting from a Unix prompt in the directory where the files were unpacked to:
# find . -ipath "*bar*.xml" -print | xargs grep -h -n '.*' | egrep '^4:'| cut -d':' -f2
# find . -ipath "*bar*.txt" -print | xargs grep -h -n '.*' | egrep '^2:'| cut -d':' -f2
I then took the two lists of numbers, pasted them into a spreadsheet, and multiplied the two sums into the final answer. I started to look at Unix tools to sum a list of numbers passed as arguments, but unsure if Google was timing me, I opted for the spreadsheet instead to keep it fast.
I could have also used "
cat -n" to generate the line-number-prefixed output, but for some reason grep was on the brain.
How would you do it?
I like this puzzle as a potential test for a network/Unix sysadmin, and plan to use it at some point (especially since we're trying to hire a system administrator at Summersault). Maybe Google will release their puzzle generation code?
3 thoughts on “Solution for Google Treasure Hunt "zipfile" question”
"How would you do it?"
Um. I guess I wouldn't. Does that mean you're not going to hire me?
fns = glob.glob('*bar*.xml')
fn_lines = [open(fn).readlines() for fn in fns]
xmls = sum(int(lines.strip()) for lines in fn_lines if len(lines) >= 3)
Then I'd copy and paste for the second one if time was of the essence, as it'd be hard to do it faster via functions. Otherwise obviously the glob pattern and line number would be parameters to a function.
I'd do it this way because my command-line-fu is weak, but my programming-fu is strong; the command-line technique is also quite clever. Though doing calculations at the command-line always seemed unnecessarily hard. Honestly globbing might not have occurred to me at first, leading to a much more complicated file traversal.
A spreadsheet?! Oy.
(Sorry, couldn't resist.)