I tried to work on this. I scraped the grammar points page, and generated a working dictionary. Itโs only for Grammar Points, no vocabulary though.
Running python script.py
does the following:
- Download (then cache) the Grammar Points Page
- Parse the page and generate a clean JSON of the data (then cache)
- Append new grammar points to the
conjugation.csv
file (used to customize recognition reading and inflection for each point) - Use the JSON and the
conjugation.csv
file to generate a dictionary according to the schemas - Output a file like
Bunpro Grammar-2025-02-06.zip
in thebuild/
directory.
The dictionary itself is pretty bad at the current state. It will only work for non-conjugatable structures, so it will find ใงใใใใใใงใใใใๅฅฝใใใใใใใฎใใใใใใใใใใใใใใใ, and will not find ใใใใใพใใใใใพใใใชใใ็ใฃ่ตค, ใใใใชใ etc. So itโs pretty much useless.
To make it better, it needs manual cleanup (in the 927 points), to setup the terms correctly, since It will never find โใใใ โกโ or โใ-Verb (Dictionary)โ, add the reading where necessary, and the inflection tag to support conjugations. Also, it would be good to scrape every grammar point page, to display more useful information directly, get more versions of the same grammar point and other stuff I canโt think of right now. Thatโs a ton of work and I probably wonโt be doing that anytime soon.
The conjugation.csv
file is currently empty, running the script for the first time will populate it. Fixing all terms, adding readings, and inflection tags in the file would be enough to make the dictionary usable. Other than that, it would be a good idea to add something to the file to โexcludeโ certain super common points, or adding other points as aliases with the same ID, perhaps. If anyone wants to do a PR to the code or send a updated csv file directly, I accept it.
Anyway, during the research I found these Grammar Dictionaries and I probably will be using those instead for the near future, although a link to Bunpro would be cooler.