Vocabulary Extractor: Make your own Decks from Manga/Anime/EBooks

As the import feature for custom decks finally came out and the conversations about this program are starting to fill up the Community Decks thread a bit too much, I thought I’d officially announce this and allow for any discussion here.

If you’d like to make your own decks from manga, anime, ebooks (pdf/epub) or other types of media, feel free to use this tool I made that does all that in just a single command: GitHub - Fluttrr/japanese-vocabulary-extractor: A script to extract a vocabulary list from various forms of Japanese media

This does require the use of the command line, but it should be pretty simple with the tutorial on the site, and if anybody needs it I’d be happy to make a video tutorial too. Other than that you just need the manga (as images), the anime subtitles in ASS or SRT format or the EBook as a PDF or EPUB.

See the “Bunpro” section on the main page for a command that will work best with the Bunpro import feature. The resulting output can simply be copy pasted into the import window, although the current import limits might require a bit of manual copy-pasting to make everything fit. Hopefully this helps some people!

Also, if you upload any decks made with this program, please leave a link to it so other people can make their own decks with it as well.

23 Likes

I played around with your extractor just yesterday actually. Your guide was really easy to follow お疲れ様🎉

But I think I saw you post that it doesn’t output the vocab chronologically? It’s just random output right? Do you have any plans for an --chronological tag or the sorts as an option? Or even a sort by frequency option?

I would prefer a chronological output option, so on bunpro I could just “learn” button the next items I’ll be working on from the book. Or even learn ahead to breeze through the next part of the book

5 Likes

The only thing I can do easily is preserve the order of appearance, I’ll actually try to implement this in a few minutes and hope it won’t make performance suffer too much, I agree that would definitely be much better than a random order within each unit

6 Likes

Just published the updated version (v1.3.0), from a short test it should preserve the order of appearance now, thanks so much for the suggestion!

7 Likes

Great tool and idea!

FYI we’re still tinkering with the Import tool and had to temporarily disable it due to some performance issues.

I don’t think the format of the text you import will change from now though, if that’s any consolation.

5 Likes

This is cool! Haven’t tried it yet, but read the GitHub page. Would there be a way to extract the frequency of each word too? That would be a good order in addition to appearance order.

2 Likes

If I find time today I’ll work on implementing that, if I understand correctly you’re just talking about the vocab appearing in the order of which appears most often in each the unit, right?

1 Like

I just found some time for it actually, it’s updated now in v1.4.0! The --freq-order option will sort the vocabulary by frequency within the source material.

Here’s an example of what that output tends to look like:

6 Likes

Pro :sob::tada:

Do we have to do anything to use the updates like reinstall?

2 Likes

Just run the following command:

pip install --upgrade japanese-vocabulary-extractor

5 Likes

Thanks for sharing! I was wondering if it is possible to filter and add vocabulary in the N lists? I would prefer to stick to vocab with example sentences.

2 Likes

Do you mean limiting the vocabulary to the vocab that also appears in the Bunpro Decks for N1-5? I think they’re custom lists so I don’t think I could easily implement this, sorry

4 Likes

Hey, this is amazing, thank you!

2 Likes

Thank you so much, I’m super happy people are finding it helpful!

5 Likes

your tool is exactly what I’ve been looking for but I can’t get it to work. I’m using python 3.12 and no matter the file I try, both subtitle and just plain txt, I keep getting back the error

‘charmap’ codec can’t decode byte 0x81 in position 50: character maps to undefined

I checked and the file uses UTF-8 encoding but I’m not tech savvy enough beyond just typing canned commands to know what to do to fix my issue :sweat: not sure what I’m doing wrong

2 Likes

Could you send any more detailed logs and possibly a file that causes this behavior? You could send the file to [email protected] if you’d like. From what I can tell this might be fixable pretty easily or it might have something to do with the file, but I can’t really test this without being able to reproduce it sadly.

2 Likes

I took a wild guess and uploaded a potential fix, try updating to version 1.4.1 and let me know if the issue is still there, I’ll be happy to fix this somehow. The fix I tried assumes you’re likely on windows and your command line is forcing a codec of the file that isn’t utf-8 and I adjusted my code to force reading the file as utf-8 now, let me know if you’re not actually on Windows. Any information will help.

4 Likes

thank you, I’ll update and try again when I’m done with my reviews and if it still doesn’t work I’ll send you the files :pray:

1 Like

whatever you did worked!! thank you so much, this is an amazing tool

4 Likes

This is quite cool, thank you! I created some scripts myself some years ago to extract vocab from epubs and subtitles (my own anki vocab deck is based on an analysis I made from the multiple GB of books and subs available on DJT), but I never thought of sharing it (my code works but it sucks).

And thank you for adding the freq order. What I especially like to do is use the freq order list (with the respective absolute and relative frequencies in the corpus I’m analysing) to check “how much from this corpus do I theoretically know, and what words to I need to learn to reach X%?”

5 Likes