asmodean's reverse engineering page

news and updates / index of tools / message board

2013/01/17 / Frequency Based Study Decks

This is a Japanese vocabulary study deck for Anki, optimized for words that appear in Tales of Xillia 2. Anki implements a smart Spaced Repetition approach that tracks performance over multiple days and paces introduction of new cards.

Words in this deck are scheduled based on usage frequency in the games and contain sample usages drawn from the script (please don't sue me).

The idea is to provide study material that is immediately useful for something you (well, I ...) actually care about, rather than typical useless vocabulary about business and travel. The sample sentences provide an opportunity to recall in context, which I find accelerates learning.

I'm interested in feedback; let me know if you try this. I plan to produce more decks in the future, but may not clean them up for release unless it seems people are finding it useful.

tox2deck.zip (Tales of Xillia 2 frequency deck)

Technical details...
This deck was produced using both Tales of Xillia 1 & 2. I wrote tools to extract game data and convert the scripts. Then I used mecab to analyze the sentence structure and extract root words. With that, I counted the frequency of each word to prioritize them. Definitions were extracted from the excellent EDICT.

Selecting sample sentences was a little complicated. I wanted to pick sentences long enough to mean something, but not be full of unknown words. The priority policy I settled on was to sum the frequency of all words and divide by the word count. i.e., try to bias towards sentences with common words. This works reasonably well, but I'm not extremely happy with the results. There's room to improve in this area.

While mostly automatic, the results were not completely perfect. I manually deleted some junk which got included due to various parsing quirks. I have also been cleaning up problems with individual cards as I see them. The most common complaint I have is with made-up game words which were parsed into individual kanji. Overall, I'm pretty happy with it though. I plan to publish revisions with further cleanup in the future.

Update 2013/01/19:
See the EDICT documentation for details about markers in the definitions. For example, the (P) marker indicates more common usages.

Updated 2013/02/12:
Added a deck for Kami-sama to Unmei Kakumei no Paradox (_—l‚Ɖ^–½Šv–½‚̃pƒ‰ƒhƒNƒX).

I haven't spent much time with it, but it should be similar to the TOX2 deck. The frequency computations are likely more oriented towards conversational vocabulary though. The TOX input data had a lot of repetitious quest and skill descriptions which created a bias towards somewhat obscure words...

I was vaguely hoping that both of these decks could share cards and scheduling information for seen cards, but Anki just creates unique cards for each deck. It's possible to merge the decks so that the sample sentences from both games are available in one deck, but I don't think it's very useful.

So I'm not sure what the best approach to overlapping decks is. Suggestions are welcome.

kamiparadeck.zip (Kami-sama to Unmei Kakumei no Paradox frequency deck)

Updated 2013/02/13:
Added a deck for Trails In The Sky Second Chapter (‰p—Y“`à ‹ó‚Ì‹OÕSC).

I got fed up decompiling the scripts, so resorted to some lame text scanning. That's messy, and a little bit of junk slipped through, but turned out better than I expected. It probably deserved some more quality checking, but I'm tired of looking at this and might never have gotten motivated to post later.

tits2deck.zip (Trails In The Sky Second Chapter freqency deck)


All source © 2006-2014, asmodean. Don't copy, learn.