Vocabulary Extractor: Make your own Decks from Manga/Anime/EBooks

Can I ask what exactly you mean by “absolute and relative frequencies”? Do you just mean the number of occurrences and the percentage those words make up? I wanna ask because if there’s any more useful sorting orders it’s definitely worth it to think about adding them

And @hisashib00ty : Thank you so much! I’m happy to help with any problems that come up at any time.

3 Likes

Yes, it’s exactly that.

2 Likes

Would you find it helpful for the program to maybe output that data as another column in the csv as well? This would have no use for bunpro but maybe it’d be helpful to check your progress as you’re going, like you can just look up the vocab item you’re currently at and you can see how many % of words in the episode you’ll understand?

2 Likes

I think that’s useful. It is for me, at least, that’s why I developed my own.
And that’s why stuff like https://learnnatively.com/ and such exist, because people want to understand how much they understand from the media they want to consume. Except this would be self-service for any media that the user wanted, and not limited to the availability in any given platform.

IDK how to extract the vocab list from bunpro or WK, but it should be doable. The main complexity is filtering for particles and lemmas in the vocab list that may not match the output from mecab. Like conjugated words, kana words in kanji form, etc.

3 Likes

Scraping Bunpro for their vocab lists would be doable, but I wouldn’t want to because they are custom lists and I’m not sure about licensing issues. I’m also not quite sure this would be worth the effort tbh, I think leaving out certain words just because they’re not on the official lists kinda defeats the purpose of the whole idea of custom decks? I think if the example sentences are necessary it’d be better to just look them up as you go and add them yourself.

4 Likes

Sounds pretty awesome! Great work!

2 Likes

Question for you @Flutter - I’m playing around with splitting a novel into chapters (with GitHub - JimmXinu/EpubSplit: EpubSplit Calibre Plugin) and wanted to ask - when I generate a csv for each chapter, does the tool already skip adding duplicates to each generated chaper csv? (So in chapter1.csv, for example, りんご would not show up twice.) I believe the documentation mentioned that the combined csv did not have duplicates, at least.

1 Like

No single file generated by my program will ever have duplicates. I could definitely make this a lot more clear, thanks for pointing that out!

2 Likes

Awesome, thanks!

1 Like

To be absolutely clear though: Words found in chapter1.csv will also be present in chapter2.csv, the duplicates across multiple files are only filtered during the combination step into the vocab_combined.csv. I hope that helps!

2 Likes

No, that makes sense. I almost asked about it, then realized how much more work it would be for the program to keep track of multiple separate word lists to compare all their vocab.

2 Likes