Arc Forumnew | comments | leaders | submitlogin
4 points by cchooper 5582 days ago | link | parent

This is my first attempt:

  (= *nwords* (counts:tokens (downcase:w/infile f "big.txt" (tostring (drain:pr:readline f))) whitec))

  (def edits1 (word)
    (with (alphabet "abcdefghijklmnopqrstuvwxyz"
           n (len word))
      (dedup:accum add
        (forlen i word (add:+ (cut word 0 i) (cut word (+ 1 i))))
        (for i 0 (- n 2) (add:string (cut word 0 i) (word (+ 1 i)) (word i) (cut word (+ 2 i))))
        (forlen i word (each c alphabet (add:string (cut word 0 i) c (cut word (+ 1 i)))))
        (for i 0 n (each c alphabet (add:string (cut word 0 i) c (cut word i)))))))

  (def known-edits2 (word) (flat:map known:edits1 (edits1 word)))

  (def known (words) (keep [*nwords* _] words))

  (def correct (word)
    (let candidates (or (known:list word) (known:edits1 word) (known-edits2 word))
      (best (compare > [*nwords* _]) candidates)))
This achieves nothing close to 10 words per second in the worst case scenario, more like one word per minute.

2 points by lojic 5582 days ago | link

Thanks! I'd say that definitely compares favorably with the Clojure version aesthetically.

The performance is a little disappointing though :( Here's the performance of my Ruby (i.e. slowest of all languages!) version:

  ~/sync/code/ruby$ irb
  irb(main):001:0> load 'spelling_corrector.rb'
  => true
  irb(main):002:0> def time_sc n
  irb(main):003:1> t1 =
  irb(main):004:1> n.times { correct "zaqxsw" }
  irb(main):005:1> puts - t1
  irb(main):006:1> end
  => nil
  irb(main):007:0> time_sc 10
So that's 2.6s per word.

The Clojure version is considerably faster (0.33s per word):

  user=> (time (dotimes i 10 (correct "zaqxsw" *nwords*)))
  "Elapsed time: 3254.863 msecs"
For words for which a correction was found, the Clojure version processed 900/s, Ruby 100/s


1 point by cchooper 5582 days ago | link

Oddly, the Ruby version runs more slowly every time I run it. I think memory locality may be the problem, as the Ruby process grows each time (now at 85 MB). That's nothing compared to my MzScheme process which is now consuming 700MB. That's probably due to the inefficient way in which nwords is constructed. A bit of tuning there might work wonders for overall performance.


1 point by babo 5578 days ago | link

Please post python's speed as a reference, it's wicked fast as far as I remember.


1 point by cchooper 5582 days ago | link

This change reduces the running time by about a third:

  (def known-edits2 (word) 
    (accum add (each e (edits1 word) (map add (keep [*nwords* _] (edits1 word))))))


4 points by rkts 5581 days ago | link

Probably because it's incorrect. You should have (edits1 e) at the end there.


2 points by cchooper 5580 days ago | link

True. The correct version is still faster though.