art taylor

 
Filed under

Visualization

 

Adventures in Haskell optimization

Phil-1
After getting some grief about the tools used in my last foray into graphing philosophical heritage[1] I decided to re-implement the influence graph in Haskell.  How hard could it be?  I've written a lot of "scripts" with Haskell, along with about half of my Project Euler problems, although I haven't touched those in a couple years.

OMG.  Seven lines of admittedly contortionist Mathematica turned into about 170 lines of contortionist Haskell, but at this point I could cut it in half (at least) by writing two monads.  Compared to Mathematica, Haskell is a bit slower for the same operations, but the "implicit" threading enabled by the type system helps even the score, snarfing up to six cores at once.

I did a simple pull of a network of philosophers (no .dot generation at that point, just displaying it with show) and Haskell's laziness hit me harder than it ever has before.  I had a huge number of thunks carrying around copies of large Maps and Wikipedia pages, and it was leaking memory like a thing that has a lot of holes in it and cannot hold water, but instead of water, it holds memory.  But not very well at all.  

Because ghc still can't build 64-bit executables on Snow Leopard, I ran up to about 2-3GB and then the program would barf.  I tried to find a way to get ghc to emit x86_64 code on 10.6, but I was stymied.  I had to make do in the cramped quarters of 32-bit addressing.  This was one area where Mathematica spanked it, and while 32- vs 64-bitness doesn't completely justify the cost difference between the "home" and "professional" versions of Mathematica 7, it does make me happy to have my aging copy of Mathematica 6.  Not gc'ing 97% of the time makes a big difference.

Because of ghc's memoization of let bindings and thunks, It became faster for me to retrieve a Wikipedia page for every piece of information I needed rather than to bind it to a variable (which I then persisted in a local file for subsequent requests).  This is because my hackish approach was to grab the name, "influences", and "influenced by" chunks from the page, and then recur up and down to grab the same information for the afferent and efferent influence lists until I hit bottom.  I recurred with proper tail calls, but I'm not sure if ghc eliminates them -- I haven't looked at the core output yet.  I wasn't about to code an iterative version.

OK, that's exactly what I ended up doing, but I used map and lazy node evaluation, so it doesn't really count.  Map is still functional!

Some minor issues with UTF-8 string formatting (thank you Codec.Binary.UTF8.String and decodeString) later, I could generate a monochrome version of the graph that I was able to export from Mathematica.  It doesn't import well into OmniGraffle Pro because OGP doesn't support penwidth for edges in dotfiles (*sigh*), but the layout is pretty fast.  I also don't have the REPL ability to get shortest paths as I could with Mathematica, but I suspect I could have done it if I could stomach using ghci.  Then again, calling ghci a REPL is stretching the definition, at least on the Mac.

When it came to emitting the dotfile, I ended up being lazy about scaling nodes based on relative influence.  I had planned an exponential moving value based on the upstream and downstream influence, so someone with no influences but who influenced a lot of people, who influenced a lot of people. who influenced a lot of people, until reaching bottom. (see where those thunks came from?) would be really large, requiring logarithmic scaling.  I wimped out at midnight due to some cycles in the directed graph, and just did a simple ratio of immediate influences and influenzas with a telomeric recursion.  If I dig this up on another night, I might finish it up, but it's already exceeding my attention span.  I figure I'll put it aside and stumble upon it in six months and wonder what I was smoking.

My last step was to apply coloring to the graph to help trace the lines.  I did a simple CLUT based on the "net clout" mentioned above.

Anyway, enjoy the ugly picture.  Mathematica was in no way harmed in the production.

Click here to download:
phil-1.pdf (1.23 MB)
(download)

NB: To be fair, I am the worst Haskell programmer in the world.  At least, the worst Haskell programmer who can get anything to compile.  The nice thing about Haskell is that if it compiles, it will come close to doing what the programmer intended, delivering on the promise of "if it builds, ship it!"  That doesn't mean it's right for me to perpetrate disasters like this:

producePhilosopherDescriptions :: Philosopher -> [String]
producePhilosopherDescriptions phil = map (\k -> "\"" ++ k ++ "\" [label = \"" ++ k ++ "\" height=" ++ (show $ philosopherInfluenceScore phil 1 k) ++ " width=" ++ (show $ philosopherInfluenceScore phil 1 k) ++  " fontsize=" ++ (show $ 10 * (philosopherInfluenceScore phil 1 k)) ++ "];") $ Map.keys $ nameToURLMap phil

[1] "Who has a copy of Mathematica lying around‽"

Filed under  //   Graphviz   Visualization   haskell  

Comments [2]

Fun with Mathematica, Wikipedia, and Graphviz

Click here to download:
philosophersl.pdf (1.89 MB)
(download)

One of the fine gentlemen at work showed me a project his son is working on in school. InPhO (http://inpho.cogs.indiana.edu/) analyses the Stanford Encyclopedia of Philosophy to build an ontology. There is some human input, and some of the areas are not fleshed out, but it's pretty groovy and they're off to a great start.

Since reading up on philosophers and their influences struck a chord with me, I cursed my colleague that he had condemned me to wasting the rest of the day clicking around his son's website. I hit one of my favorites, Baruch Spinoza, and noticed that the set of influences and those influenced wasn't heavily populated. Since writing stupid programs and throwing them into smarter programs is something along the lines of a way of life for me, I hacked together a little program to crawl Wikipedia from a known starting point, and crawl afferent and efferent influence declarations until I reached leaf or ingenerate nodes. One particularly vexing issue with Wikipedia is inconsistent naming for the same person (Hegel was a particularly painful example) and I leaned on a little help from my old friend Levenshtein to combine some names, but some duplicates made it through. In addition, some concepts appeared to be first-class citizens, and while I was able to flag some, others snuck through.

Lists_of_marxists

With the data in hand, I was able to ask Important Questions, such as,"What is the chain of influences connecting Aristotle to Tom Waits?"

Tom_waits

How about Plato and Pynchon?

Thomas_pynchon

I bet you can't connect Ovid and Chomsky!

Chomsky

OK, that's fun enough, but I wanted pictures! I generated a plot in Mathematica, but it was the least fun plot ever. Even in 3D.

Mma_plot

I emitted a dot file for Graphviz, and it promptly created the second-to-least fun graph ever. Growing tired of the exercise, I imported the dot file into OmniGraffle Pro and crashed it about a hundred dozen times, making incremental tweaks to the top ten influencers (i.e., the huge vertices) and the layout to make the behemoth image that may or may not be above, depending on whether Posterous wanted any part of this business. If you have a plotter handy, I highly recommend seeing if it will become sentient upon putting this mess to paper.

I poked around the plotted graph to see if it was worth the trouble to which I had gone, and found some amusing little connections, giving a lens into the minds of Wikipedia editors and philosophers.

Jesus

I think perhaps a LITTLE more influence is called for there. Then again, it appears founders of religions are under-appreciated.

Muhammad

I'm not going to comment on the fact that there is no line connecting the upper oval to the lower one, other than to say that it's obvious Wikipedia is a wretched hive of scum and infidels.

One thing that surprised me, that I alluded to above, is how many non-philosophers were drawn into the graph. A lot of authors, and even some directors (The Wachowski brothers and J. J. Abrams, for example) made appearances.

Philip_k

And to top it all off, even Jimmy Buffett snuck in there, Hal Holbrook and Michael Crichton in tow. Fortunately, it appears they have influenced even fewer people than Jesus or Muhammad.

Jimmy_buffett

If it weren't already 3:20am, I'd probably try to refine the data collection a little more. However, two hours spent on this is enough. Most of it was spent tweaking in OmniGraffle, but I couldn't sleep and thought I might as well do something useless.

Filed under  //   Graphviz   Mathematica   Philosophy   Visualization   Wikipedia  

Comments [0]