[Request] Bridging the gap between Bunpro and Yomitan

rururun · October 2, 2024, 1:18pm

If you’re familiar with Yomitan, it’s a pop-up dictionary tool with support for multiple kinds of dictionaries. Bilingual, monolingual, frequency, pitch, etc. It would be really cool if Bunpro had support in Yomitan. Imagine looking up an unknown word or grammar pattern in Yomitan and being able to click a button to open a link to it on Bunpro in a new tab. It would be really convenient because if I’m reading an ebook or a news article or something and I find an unknown word or grammar pattern that’s in Bunpro’s database I could learn it on the fly and add it to my reviews.

Yomitan is open source and anyone can make a Yomitan dictionary, and they’re easy to make too because they’re just json datasets that get rendered out into HTML.

I want a Bunpro “dictionary” for Yomitan that has links for all of the words and terms to their relevant pages on Bunpro’s website. I would make it myself but Bunpro doesn’t have a publicly accessible API and I don’t want to scrape it. It probably wouldn’t be too hard for someone at Bunpro to write a python script that goes through the database and adds every term and link to a json database and make a Yomitan dictionary from it.

What do you guys think?

mike108 · October 2, 2024, 1:43pm

What’s wrong with doing that with Anki?
Or do you just prefer studying vocab on Bunpro instead?

What if the vocabulary word found in the wild doesn’t exist in Bunpro?

Sorry for the questions, I’m just curious as to what this would achieve

BigFreakinChungus · October 2, 2024, 7:01pm

I use yomitan a bit. When you mouse over text with it enabled, it shows you entries from all of the dictionaries you have added. I think what they’re asking is along the lines of, does a dictionary for yomitan exist that provides links to bunpro for vocab/grammar that bunpro has.

It’s actually a fairly good idea, I think. Maybe they just worded it strangely. It would be cool to mouse over some text I’m reading, and right there be able to see that there is a bunpro page on that grammar point. Then I could add that to a cram really quick and then at the end of my reading session do the cram.

It of course doesn’t do anything for grammar/vocab that bunpro doesn’t already have. You would just see your other added dictionaries in those cases.

veritas_nz · October 3, 2024, 6:46am

Definitely one of the plugins/projects that would go great once we eventually re-add an API.

We have JMDict IDs attached to each of our Vocab, so if Yomitan also allows you to reference that, (I would think it) would be a pretty easy integration to set up?

Marifly · October 3, 2024, 8:40am

The plugin 10ten shows info about WaniKani and Bunpro levels in its popup. If it works in 10ten it should be possible to do something similar in Yomitan?

rgggt · October 3, 2024, 11:34am

The problem with Anki is this. Say you have an Anki deck for Yotsobato and one for flying witch. Both decks will share a lot of the same words. If those decks were available in Bunpro. Then it would tell me how many words I already know, kinda like mangakotoba plus grammar points and I wouldn:t have to learn any words again. Yeah you can hit I know this 5 times over in Anki but if it never came up and the learn button just showed new words it would be better.

rururun · October 3, 2024, 12:21pm

API would be good for checking if it’s a word you’ve already learned or to directly add it to your reviews without navigating away from the page, but that would require someone to contribute a PR to Yomitan with that integration.

What I’m thinking of is more like an importable dictionary that just has a link to the relevant Bunpro web page. I tried making one earlier today by scraping the vocab and grammar decks for lists of terms but in the end I couldn’t get it working because it was injecting hiragana in parenthesis and I couldn’t get a proper pattern working for all of the URLs. Needles to say I gave up after a while.

BigFreakinChungus · October 3, 2024, 2:57pm

This might be something you’ve already tried, but have you been stripping the parenthesis from hiragana characters through whatever script you’re using? Should be simple do do a linear search of each element of a string and remove parenthesis.

homa · October 4, 2024, 12:39pm

I like this idea, especially for grammar points!
But definitely will be tricky to setup correct recognition

rururun · October 4, 2024, 12:43pm

Recognition is handled by Yomitan (I think it uses mecab for parsing). Recognition for vocab would be no problem, but I’m unsure about some of the grammar points. There are other grammar dictionaries for Yomitan though so I think just need to reference them.

rururun · October 4, 2024, 12:47pm

I’m not sure how to do that in python but I tried begging ChatGPT to do it for me and of course ChatGPT is garbage so whatever it tried didn’t work.

I have a JSON file with a list of all words and grammar points from all of the JLPT decks, along with the relevant level and a rough url scheme going on (seems to work for all vocab if I remove parentheses, but not all grammar). I’ll have to figure out how to remove those characters in parenthesis and figure out how Bunpro handles urls for grammar items. I found that in some cases 〜 needs to be replaced with - but in some cases doing so still didn’t result in a working url so it needs some more tinkering.

veritas_nz · October 5, 2024, 1:37am

Can you elaborate?
This might be something we could provide for users

BigFreakinChungus · October 5, 2024, 2:53pm

As for removing certain characters from Python strings. You should be able to use the replace() method. Just set the replacement as an empty string.

Looks like this one removes all instances of the specified character. If the substring you provided to replace is not found, (in our case, ‘(’ or ‘)’) it just returns the original string. So it should be safe to call even with strings that don’t need to be edited.

As for handling URLs, I would suggest opening up the grammar points you are having issues with and comparing what the browser shows to what you are getting.

rururun · October 5, 2024, 3:14pm

So Yomichan dictionaries are just a zip file with a collection of json files, and those tiles just contain a collection of arrays and objects for each entry. In the case of a traditional dictionary this would include data such as the word itself, the reading of the word, and the definition of the word. However, you can put pretty much anything in there and it will work. For example there is a Pixiv dictionary which contains data scraped from Pixiv’s netslang dictionary and each entry in that dictionary includes a link to the original page on pixiv.

My intention for a Bunpro deck is that it would just have a link to the relevant Bunpro page for each item. I like to read visual novels and my Idea was that if I find something I don’t know and it’s in Bunpro I could follow the link to quickly read up on it or maybe add it to my reviews.

rururun · October 5, 2024, 3:14pm

A little more complex than that. We would need to remove all of the characters contained between those parentheses as well.

BigFreakinChungus · October 5, 2024, 3:32pm

This stackoverflow question talks about using regular expressions with sub(). This could be useful.

Or, you could use split() to break the string at each set of parenthesis, discard what is inside them, and then stitch the strings back together. (this is what I would do)

Also, in my experience, ChatGPT is really good at simple string manipulation functions. Maybe you just didn’t ask for it descriptively enough?

rururun · October 6, 2024, 1:37am

Thanks I’ll give it a look after the weekend is over.

homa · October 28, 2024, 8:52pm

So are you actually making it?

SmolSwol · October 28, 2024, 10:01pm

I also wanted a bunpro api so I could programmatically find and export information from grammar points, and would also prefer not to scrape (and the pages aren’t consistent enough anyway, I think.)
I would love if we had a good API. I’m constantly writing code to lower the barrier between me and making cards so I can maximize words, but I really couldn’t find a good grammar dictionary that had an accessible api. I hope this post maybe puts a boost behind the api development

rururun · November 1, 2024, 1:11am

Nah, I’m busy making open source Digimon