Bunpro Vocab Decks (Announcement) - Mar/29/2024

Asher · March 31, 2024, 3:50am

Hi everyone!

Hope you’re all enjoying your week and getting excited to start some book club action together!

This is a quick post about the Bunpro vocab decks for novels/manga etc, and how the evolution of this system is taking place on our back end. It will also include a bit of information about what you can expect in the near future.

The Bunpro Text Wrapper

Some of you may remember that we started ‘wrapping’ grammar points within standard grammar sentences about 2 years ago. This was a new feature at the time which meant that you could hover over a specific part of any sentence that you didn’t fully understand, click on it, and if the part you clicked on was one of our grammar structures, a little popup would tell you what the grammar was.

Believe it or not, when we first started doing this, it was just @Fuga and myself sitting in Google docs all day staring at walls of Japanese text and inserting html tags manually for each and every grammar point. Not the most effective system in the world, and certainly prone to a few… human errors.

Despite the rocky start, we knew that the capability that grammar wrapping provided was something that would make the life of any student far easier, and that we should explore the concept further. Since last year we have been slowly developing a tool that we will use behind the scenes that will do this work for us.

We have adapted part of this tool to provide vocab decks for any and all media going forward.

Beta Testing (kind of)

This tool is still in a v1 alpha phase and we are working to tweak the output when applied to this specific use case, but the best way for us to do so is to put it to work and adjust it as it parses more content.

To start with, we will only be providing decks that cover the first chapter of books, then the second chapter will be added to the decks the next week, etc etc. This is so that we can do a lot of quality checking on relatively small amounts of text each time, and allow us to potentially fix things that will make each and every batch of new words contain less errors.

As many of these ‘errors’ are moreso things that are far easier to spot by people than by our tool, we would like to actively request those that are using the decks to give us feedback. Here are some examples of things that our tool cannot catch (just yet), but are easy to catch for people.

Words that can be said in different ways, but use the same kanji. For example - 避ける with the reading of さける, and 避ける with the reading of よける. Unless there is furigana, the only way to know which one is being used is through context. If you think that one of the words in a deck has the wrong reading, please let us know!
In the reverse of the situation above, sometimes words in books are written as hiragana only, when there are several options for that word. An example here is おさめる. Does the book mean 修める, 収める, 治める, or 納める? Again, if you think our tool chose the wrong one, please let us know!
Set phrases being split (if we don’t have a record of the phrase already). There are many times that words are joined in Japanese as part of a set phrase, and are better off learned that way, rather than as individual words. This can be verbs that use other verbs as auxiliaries like 持ち上げる ‘to lift up’, or just things that are set patterns like 目を凝らす ‘to stare intently’.

In either case, due to them being better learned as a pattern, rather than as 持つ followed by 上げる in a deck, we would love users to point out any of these that you may notice are not joined correctly. Thankfully we already have many collocations and phrases in our vocab database, so this type of error will become far less common very quickly as we keep adding new words and phrases.

Although these are the most obvious ones. There are bound to be some other oddities that show up here and there.

Lastly we can only link to words we already have in our database (about 30k at present). So words that show up in books but that we don’t have yet will be added as we go. If you see a word in the book, but not in the deck, please let us know so we can add it and update the deck.

Inconsistency of Example Sentences

One of the best things about the vocabulary database here on Bunpro is all of the natively written example sentences that are designed to give you lots of usage context about new words, at the same time as using a huge variety of the grammar that you will have learned on the site.

Despite this, as our book club decks will be created based on appearance in the book, rather than the actual level of the words, you are likely to encounter many words that do not have example sentences yet. In these cases, please just keep this in mind. If the vocab item is in a deck, you can rest assured that it is already in our queue to have example sentences written for it. It just might take a while depending on where the word is in the list that we follow.

Quality example sentences for all vocabulary that make active use of grammar that students are currently learning is a huge priority for us, so we will continue to get these out as quickly as we can.

Wrapping things up

We can’t wait to get this tool up to a level of accuracy that will enable us to instantly create decks from almost any text file, whether that be subtitle scripts, novels, light novels, you name it. We estimate that with each new book we parse, the amount of new words and errors will decrease, so it is only a matter of time until it is at the high standard that we appreciate our users holding us to.

Book club vocab decks will come first, but it is highly likely that we will start adding decks for more and more books in the background as time goes on, half for the purpose of training our tool as quickly and efficiently as possible, and half for the purpose or wanting to make a wide array of popular titles available to our users in a timely manner.

Thanks again to everybody participating in the book clubs, and we hope that through working together, this will be yet another addition to our catalog of features that will help you achieve mastery of Japanese in record time!

Have a rockin day!

User requests/development plans (Ongoing Edit)

A list of things that users have requested to be able to do with our book decks, as well as with the wrapping tool in general.

Be able to see a list of possibilities when a word in the deck is ambiguous due to its appearance in the book. (kana only when there is multiple kanji, or kanji only when there is multiple kana).
Link vocab entries that are also grammar points together so that progress is saved for people that have already studied the grammar.
Use the parser to make custom decks from user’s own content that they upload.
Be able to filter decks by JLPT level, etc.
Be able to track progress across books and display how many unknown words are in unread books.
Make suggestions for new books to read/decks to study based on previously studied book categories.

Marifly · March 29, 2024, 11:06am

This looks great! Thank you!

I think I found one of these on the first page. ヅャ-ヅをはいた - the deck has 吐く but it makes more sense if it’s 履く.

Really looking forward to reading the book with this!

Asher · March 29, 2024, 11:28am

This is exactly what we’re looking for! Thanks for pointing it out . In the future, we should be able to avoid such errors by tagging words with verbs that they’re commonly used with. For example, any noun with the ‘clothing’ tag would use verbs like 履く instead of 吐く.

Quick edit - You may find a few errors in the deck tonight! We didn’t announce the location of the decks yet because we were going to publish a lot of fixes tomorrow before the book clubs got started . Definitely let us know any weird kanji though. The fixes we are publishing are mainly missing words and misuses of phrases etc.

Inounx · March 29, 2024, 11:47am

I am quickly going through よつばと！deck, to see which words I don’t know, and this is what I saw:

それに in よつばと deck is marked “vocab” and different than the grammar point それに in N4 deck.
Idem for かもしれない (よつばと) and かもしれない (N4 deck)

You probably already seen these, but mentionning it if it could helps

imsamuka · March 29, 2024, 11:46am

Since these problems are recognizable by the parser, have/can you add a highlight warning during the parsing process to find these beforehand? For low character count stories like manga, it could be quite easy to fix all these, but novels and books I don’t know how much would appear…

This could be a little similar to how git solves merging conflicts, for example.

Asher · March 29, 2024, 11:56am

We can find the common ones, and are in the process of writing lots of rules into our parser. Something to highlight words that could come out wrong is a good idea . Most of these are things that can be fixed very easily, they just need a human eye atm. Once we factor in word relationships for many of them, something like a highlight warning wouldn’t even be needed.

Asher · March 29, 2024, 12:00pm

They are both the same thing. There are some grammar points that we have on Bunpro that are also considered ‘vocab’ by many textbooks, and even Japanese dictionaries . We will probably just write a warning on these vocab items in the very near future so that people know that they’re the same. For the meantime, just consider them extra example sentences for the grammar .

Inounx · March 29, 2024, 12:04pm

No problem, I know this is still an early version. It just caught my attention because I noticed that other vocab words keep the SRS progression from one deck to another. But it was not the case for theses ones, because they are of different type.

Looking forward to see what these vocab deck will be in the end, this is very promising !

Mango1 · March 29, 2024, 2:48pm

This is very exciting :D. I’ve been waiting for there to be a tool like this to help with reading books and it just happened to be done here on Bunpro only about 2 weeks after I signed up. If this tool becomes something individual users can use to create our own custom decks, it would be really cool to test it out on some TV show dialogue and song lyrics.

wct · March 29, 2024, 4:02pm

You might want to ‘simplify’ things by having your tool recognize when there is ambiguity, and then ‘just’ providing multiple options for the user to choose from, rather than trying to have everything automagically selected by the tool itself.

I would recommend checking out how the browser plugin 10ten Japanese Reader handles the same/similar situation (or perhaps an equivalent one; perhaps Yomichan or whatever the latest is called; but I haven’t used it myself, so I don’t know if it works the same).

For example, with 避ける, when there are multiple possible readings/meanings for the same literal characters, 10ten will:

Merge similar readings (and their related nuanced meanings) into a single ‘dictionary entry’ containing all of the readings and meanings, leaving it to the user to differentiate which one is intended, such as (my annotations {like this} ):
- 避ける [さける, よける]
  (1) (v1,vt) to avoid (physical contact with) {typically よける}
  (2) (v1,vt) to avoid (situation); to evade (question, subject); to shirk (one’s responsibilities) {typically さける}
  (3) (v1) to ward off; to avert {either/both}
  (4) (v1,vt) to put aside; to move out of the way {typically よける}
Separate out significantly distinct meanings into distinct ‘dictionary entries’, and simply listing all the matching dictionary entries to allow the user to select from them. A simple example is 様, with multiple readings/meanings. (NB: this would require, equivalently, on BP, that there be multiple distinct Vocabs with the same literal character string; there are indeed multiple such cases in existing BP Vocab, 様 being but one example.) 10ten simply matches them all, and presents them in a simple consecutive list form, like so:
- 様 [さま]
  (1) (suf) (hon) Mr; Mrs; Miss; Ms (after a person’s name, position, etc.)
  (2) (suf) (pol) makes a word more polite (usu. in fixed expressions) (usu. after a noun or na-adjective prefixed with お or ご)
  (3) (n) state; situation; appearance; manner
- 様 [よう]
  (1) (n-suf,n) (uk) appearing …; looking … (usu. after the -masu stem of a verb)
  (2) (n-suf,n) way to …; method of …ing (usu. after the -masu stem of a verb)
  (3) (n-suf,n) form; style; design (usu. after a noun)
  (4) (n-suf,n) like; similar to (usu. after a noun)
  (5) (n) thing (thought or spoken)
- 様, 態 [ざま, ザマ]
  (1) (n) (derog,uk) mess; sorry state; plight; sad sight
  (2) (suf) -ways; -wards (indicates direction)
  (3) (suf) in the act of …; just as one is … (after the -masu stem of a verb)
  (4) (suf) manner of …; way of … (after the -masu stem of a verb)
- 例, 様, 例し [ためし]
  (n) precedent; example
In the case of おさめる, 10ten does a mixture of these two approaches. For example, it merges 収める and 納める, since there’s a lot of overlap there (but basically because that’s how Jisho/JMdict handles it, considering them both as different possible kanji for the same essential ‘dictionary entry’), but has distinct ‘dictionary entries’ for 修める and also for 治める. Again, it simply matches all of them and then lists them all in a row, allowing the user to make the choice between them.
- 収める, 納める [おさめる] { merged }
  (1) (v1,vt) to put (into); to put away (in); to put back (in); to keep (in); to store (in); to restore (to its place)
  (2) (v1,vt) to include (in an anthology, catalogue, etc.); to contain; to publish (in); to capture (on film)
  (3) (v1,vt) to achieve (results, success, etc.); to obtain; to get; to gain; to win; to make (a profit)
  (4) (v1,vt) to pay (fees, taxes, etc.); to deliver; to supply
  (5) (v1,vt) to accept (a gift or money)
  (6) (v1,vt) to keep (within a limit)
  (7) (v1,vt) to offer (to a shrine, deity, etc.); to dedicate
  (8) (v1,vt) to subdue; to suppress; to settle
  (9) (v1,vt,suf) to finish; to conclude; to wind up; to bring to a close
- 修める [おさめる] { distinct }
  (1) (v1,vt) to study; to complete (a course); to cultivate; to master
  (2) (v1,vt) to order (one’s life)
  (3) (v1,vt) to repair (a fault one has committed)
- 治める [おさめる] { distinct }
  (1) (v1,vt) to rule; to govern; to reign over; to administer; to manage (e.g. a household)
  (2) (v1,vt) to subdue; to suppress; to quell; to settle (e.g. a dispute)
As for things like set-phrases, which are often composed of smaller, individual dictionary entries, 10ten ‘solves’ this issue by trying to match as greedily as possible (to match as much of the text as possible, given the entries it has available to match with), and so it will always give you a long set-phrase as the first result, if such a match is possible. However, 10ten also will match any smaller entries that appear under the cursor. It then, once again, simply lists all the matches together, from ‘greediest’/largest match to smallest. For example, for 持ち上げる, it provides this list:
- 持ち上げる, 持上げる [もちあげる]
  (1) (v1,vt) to elevate; to raise; to lift up
  (2) (v1,vt) to flatter; to extol; to praise to the sky
- 持ち [もち]
  (1) (n,n-suf) having; holding; possessing; owning; using; holder; owner; user
  (2) (n,n-suf) wear; durability; life (also written as 保ち)
  (3) (n,n-suf) charge; expense
  (4) (n) (form) draw (in go, poetry contest, etc.); tie
- … (It actually continues to ‘drill down’ in the match with entries such as 持 [じ]; but it also does other ‘smart’ matching such as recognizing the root verb from a ‘conjugation’, so it also lists 持つ, and marks it as being related to 持ち, because 持ち is the ‘masu stem’; really should check this plugin (or similar one) for inspiration, IMHO )

Obviously, this is only a workaround. There is usually only one ‘correct’, ‘true’ meaning/nuance implied by a given text, and thus also usually only one ‘correct’ kanji/reading (even if the literal word is written with kana only).

But it’s a useful workaround in the ‘Pareto’ sense that if you could get this working, then you’ve gotten 80% of the usability that you want with only 20% of the effort.

It at least gives the user access to the correct meaning/nuance/reading/kanji, although it does leave that final ‘disambiguation’ up to the user.

And I guarantee you that it will be much easier to implement such a tool than a completely ‘automagic’ tool that correctly guesses everything on its own.

Even if you merely use this kind of 10ten-inspired interface behind-the-scenes at BP to help you guys quickly and efficiently select the correct option out of the ambiguities, rather than deferring that choice to the end-users (us! ), it would save you guys a lot of time, hassle, and human-error / typo related bugs.

Also, by developing this kind of half-way interface, you guys will gain a lot of understanding and knowledge of what it will take to finally go ‘all the way’ towards a ‘fire and forget’ automagic tool that would do most/all of this additional selection for you. So, trying this first wouldn’t actually be a detour, but more like ‘laying the groundwork’ for the full tool, if you end up putting in the time/effort to implement the full tool in the end.

I could say a ton more, but that’s good for now. Hope this idea helps somewhat!

Pablunpro · March 30, 2024, 12:03pm

Hi!

This might be an obvious thing, but, as far as I’ve seen, words in the decks appear in the text sequencing order, i.e. in the order of appearance within the book.

This is quite useful whilst reading, as you can read along the vocab deck quite smoothly.

Thought it worth mentioning.
Have a nice day!

Asher · March 30, 2024, 12:28pm

Yep, it’s intentional that they appear in the sequence of the book, as we assumed that would be the order that most people would naturally learn them

Asher · March 30, 2024, 1:01pm

As we have been working on this tool for quite some time in the background, we already have plans of attack for most of the potential problems that you have expressed here. Rather than being magic, almost everything is figured out through probability systems that we program in ourselves, and will continue to program in until we get it as close to perfect as possible.

For example with the さけるよける example, we can set up things like this to check the surrounding vocabulary in order to decide the likely reading. For example, if the surrounding words contain something like 攻撃, 拳, 突撃, 一発, パンチ, or really anything that indicates a ‘physical attack’, then we could prioritize よける being chosen.

However, we will be setting up a reporting system on words that can be read in different ways, as there is only so close we can get to perfect. Naturally there will always be situations where an author intentionally uses a word in an ‘unconventional’ way to get a point across, or even as a joke.

This one is already on our list of things to implement. On our back end, we have separated out all words that have the same reading (for verb families), and will be creating linked vocab trees in a similar fashion to the way that synonyms have always been presented on Bunpro grammar. For example, when learning 収める, the entries of 納める, 治める, 修める will appear as synonyms, and be linked as their own individual vocab items.

For what we are doing this for and what we aren’t doing it for, there is actually a technical term for this in Japanese. They are called 同訓異字. Any word that is 同訓異字, we will be building synonym trees for, as they actually are synonyms in terms of their meanings being the same in at least some sense.

We already have this set up for grammar detection, and for vocab as well. It is definitely the best way to catch phrases. We will also be setting up phrase recognition when particles are omitted, such as when 気が付いた is said as 気付いた.

We know that we can’t get it perfect. But we can get it to something above 95% accuracy with intense training, which is what we intend to do.

At the moment, we’re actually fairly close to implementing a lot of fixes in the system that will get this working with a very high level of accuracy. However, we are definitely in the stage at the moment of wanting our users to be very strict with what the system itself spits out for decks. As you state yourself, the more that we can understand about the errors, the easier it will be for us to develop onward toward a ‘near’ perfect tool.

Asher · March 30, 2024, 1:04pm

When we do get the tool up to the desired standard, we would love for users to be able to use it for their own purposes, like for the type of content you mentioned. Actually having this tool available to users themselves though is something that we would have to figure out logistically, so may take some time. Fingers crossed!

Orock45 · March 31, 2024, 2:59am

Hi there! I’m going through the vocabulary deck for スーパーカブ and wondering if I can filter it. It’d be useful to filter words I’ve learned versus those I haven’t and by JLPT level. Also, an option to mark all JLPT N5 words as known would be great. Plus, having a setting to focus on learning vocab above a certain JLPT level, like N4 or higher, would be helpful.

With jpdb and the ttsu reader app and jpdb-reader extension for Chrome, you can see which words in the text you know and don’t know. It gives a quick overview of what’s left to learn. It would be really nice to have a similar function in bunpro.

Asher · March 31, 2024, 3:33am

Will keep a log of requests for deck capability at the top of this post from things that users mention here and in other threads so that our developers are able to get on top of as many of them as soon as possible.

@Orock45 will add yours to it too now.

Marifly · March 31, 2024, 1:39pm

Only 2 of the vocab decks is available in the app (Yotsuba is missing). For the two decks that are available, the number of items in the decks is different between the app and the web site.

Asher · March 31, 2024, 1:41pm

For app related issues, @mrnoone will be able to give you the quickest fixes

drunkgome · April 1, 2024, 3:42am

for the star wars deck, the novel says 「時空を超越した偉大な国家であった」
the word after 国家 in the deck is 出会う. I might be wrong, but it seems like it should be である instead.

in the future, should we continue posting any possible errors here or should we post them in the related book club thread?

Asher · April 1, 2024, 3:50am

Posting them in the thread, or just PMing them to me directly is also fine. Thanks for the report! Will fix now.