Vocabulary Extractor: Make your own Decks from Manga/Anime/EBooks

Flutter · December 11, 2024, 9:22am

Can I ask what exactly you mean by “absolute and relative frequencies”? Do you just mean the number of occurrences and the percentage those words make up? I wanna ask because if there’s any more useful sorting orders it’s definitely worth it to think about adding them

And @hisashib00ty : Thank you so much! I’m happy to help with any problems that come up at any time.

teclasgmxbunpro · December 11, 2024, 9:49am

Yes, it’s exactly that.

Flutter · December 11, 2024, 9:54am

Would you find it helpful for the program to maybe output that data as another column in the csv as well? This would have no use for bunpro but maybe it’d be helpful to check your progress as you’re going, like you can just look up the vocab item you’re currently at and you can see how many % of words in the episode you’ll understand?

teclasgmxbunpro · December 11, 2024, 3:05pm

I think that’s useful. It is for me, at least, that’s why I developed my own.
And that’s why stuff like https://learnnatively.com/ and such exist, because people want to understand how much they understand from the media they want to consume. Except this would be self-service for any media that the user wanted, and not limited to the availability in any given platform.

IDK how to extract the vocab list from bunpro or WK, but it should be doable. The main complexity is filtering for particles and lemmas in the vocab list that may not match the output from mecab. Like conjugated words, kana words in kanji form, etc.

Flutter · December 11, 2024, 3:21pm

Scraping Bunpro for their vocab lists would be doable, but I wouldn’t want to because they are custom lists and I’m not sure about licensing issues. I’m also not quite sure this would be worth the effort tbh, I think leaving out certain words just because they’re not on the official lists kinda defeats the purpose of the whole idea of custom decks? I think if the example sentences are necessary it’d be better to just look them up as you go and add them yourself.

scarrera · December 13, 2024, 8:37am

Sounds pretty awesome! Great work!

eefara · December 16, 2024, 2:47pm

Question for you @Flutter - I’m playing around with splitting a novel into chapters (with GitHub - JimmXinu/EpubSplit: EpubSplit Calibre Plugin) and wanted to ask - when I generate a csv for each chapter, does the tool already skip adding duplicates to each generated chaper csv? (So in chapter1.csv, for example, りんご would not show up twice.) I believe the documentation mentioned that the combined csv did not have duplicates, at least.

Flutter · December 16, 2024, 2:45pm

No single file generated by my program will ever have duplicates. I could definitely make this a lot more clear, thanks for pointing that out!

eefara · December 16, 2024, 2:47pm

Awesome, thanks!

Flutter · December 16, 2024, 2:50pm

To be absolutely clear though: Words found in chapter1.csv will also be present in chapter2.csv, the duplicates across multiple files are only filtered during the combination step into the vocab_combined.csv. I hope that helps!

eefara · December 16, 2024, 2:52pm

No, that makes sense. I almost asked about it, then realized how much more work it would be for the program to keep track of multiple separate word lists to compare all their vocab.

mastegredin · January 4, 2025, 9:32am

Great work, thanks!!

Zanzou · January 6, 2025, 2:28pm

How accurate is this for manga? In the past I’ve had very bad experiences with the inaccuracies of OCR with manga. It was nearly useless due to the large number of misparses. There are so many irregularities in manga text/font/style/positioning.

Flutter · January 6, 2025, 2:47pm

This is just as accurate as Mokuro if you have any experience with it (because it literally just uses the mokuro output), from my experience it’s pretty good even with challenging source material. Of course there are mistakes but I definitely wouldn’t say it’s “nearly useless”, but you can gladly try it out on your own source material. Of course the higher the resolution the better the result will be

Zanzou · January 7, 2025, 8:50am

Thanks, I just tried giving it a go! I installed python 3.12 as recommended. Something went wrong running the tool - I got a couple errors (below). Does this look like something on my end? To keep it simple, I dumped all of my manga JPG files into one folder, ran it, then got an empty CSV file as output:

Code

C:\Users\zan>jpvocab-extractor --type manga C:\Users\~~\ALL_FILES
2025-01-07 02:17:44,722 | INFO | root - Extracting texts from C:\Users\~~\ALL_FILES...
2025-01-07 02:17:44,722 | INFO | root - Running mokuro with command: ['mokuro', '--disable_confirmation=true', 'C:/~~/ALL_FILES']
2025-01-07 02:17:44,722 | INFO | root - This may take a while...
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\zan\AppData\Local\Programs\Python\Python312\Scripts\mokuro.exe\__main__.py", line 4, in <module>
  File "C:\Users\zan\AppData\Local\Programs\Python\Python312\Lib\site-packages\mokuro\__init__.py", line 3, in <module>
    from mokuro.manga_page_ocr import MangaPageOcr as MangaPageOcr
  File "C:\Users\zan\AppData\Local\Programs\Python\Python312\Lib\site-packages\mokuro\manga_page_ocr.py", line 7, in <module>
    from comic_text_detector.inference import TextDetector
  File "C:\Users\zan\AppData\Local\Programs\Python\Python312\Lib\site-packages\comic_text_detector\inference.py", line 14, in <module>
    from comic_text_detector.utils.io_utils import imread, imwrite, find_all_imgs, NumpyEncoder
  File "C:\Users\zan\AppData\Local\Programs\Python\Python312\Lib\site-packages\comic_text_detector\utils\io_utils.py", line 11, in <module>
    NP_BOOL_TYPES = (np.bool_, np.bool8)
                               ^^^^^^^^
  File "C:\Users\zan\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\__init__.py", line 427, in __getattr__
    raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'bool8'. Did you mean: 'bool'?
2025-01-07 02:17:55,287 | ERROR | root - Mokuro failed to run.
2025-01-07 02:17:55,287 | INFO | root - Getting vocabulary items from all...
2025-01-07 02:17:55,287 | INFO | root - Vocabulary from all: , ...
2025-01-07 02:17:55,287 | INFO | root - Processing CSV(s) using dictionary (this might take a few minutes, do not worry if it looks stuck)...
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 79.32it/s]
2025-01-07 02:17:55,318 | INFO | root - Vocabulary saved into: vocab_all in folder C:/~~/ALL_FILES

Flutter · January 7, 2025, 9:08am

Could you run “pip list” and check the version of numpy you have installed? Thank you!

EDIT: If you notice it’s v2.x or above, try this command:

pip install numpy==1.26.4

Zanzou · January 7, 2025, 10:11am

That fixed the error, thank you!! It’s processing - not done yet - but as it looks like it’s going to take quite a while to finish the whole thing I figured I would give an update now!

Flutter · January 7, 2025, 10:30am

Thanks a lot, I’ll add a note on the page with the fix, the error appears to be caused by an update to some of the packages I’m using so there’s nothing I can currently do to fix it directly sadly.

And yeah, processing the manga with mokuro can take a long time. If you have a compatible GPU you might be able to install PyTorch in order for mokuro to make use of it: Start Locally | PyTorch I can’t really provide much help with this though sadly, as far as I know mokuro should just automatically use it once it’s installed and working correctly.

Zanzou · January 7, 2025, 11:08am

Interesting I’ll look into it after this batch. Thanks!

All good - I’d rather that the tool take a long time if it helps with accuracy rate.

FlippFuzz · April 10, 2025, 2:07pm

I was messing around comparing JPDB and this tool.

Sentence: こんな料理上手なお母さんを持って幸せなんだから分かってるのか？
JPDB: こんな, 料理上手, お母さん, を, 持つ, 幸せ, なんだ, から, 分かる, のか
japanese-vocabulary-extractor: こんな, 料理, 上手, だ, 御, 母, さん, を, 持つ, て, 幸せ, の, から, 分かる, てる, か

Which do you guys think is better?
JPDB seems to be focused on forming longer words, while this tool seems to be breaking into smaller components.
Can’t tell which is better. Both seems OK in their own way.