Since these problems are recognizable by the parser, have/can you add a highlight warning during the parsing process to find these beforehand? For low character count stories like manga, it could be quite easy to fix all these, but novels and books I don’t know how much would appear…
We can find the common ones, and are in the process of writing lots of rules into our parser. Something to highlight words that could come out wrong is a good idea . Most of these are things that can be fixed very easily, they just need a human eye atm. Once we factor in word relationships for many of them, something like a highlight warning wouldn’t even be needed.
They are both the same thing. There are some grammar points that we have on Bunpro that are also considered ‘vocab’ by many textbooks, and even Japanese dictionaries . We will probably just write a warning on these vocab items in the very near future so that people know that they’re the same. For the meantime, just consider them extra example sentences for the grammar .
No problem, I know this is still an early version. It just caught my attention because I noticed that other vocab words keep the SRS progression from one deck to another. But it was not the case for theses ones, because they are of different type.
Looking forward to see what these vocab deck will be in the end, this is very promising !
This is very exciting :D. I’ve been waiting for there to be a tool like this to help with reading books and it just happened to be done here on Bunpro only about 2 weeks after I signed up. If this tool becomes something individual users can use to create our own custom decks, it would be really cool to test it out on some TV show dialogue and song lyrics.
You might want to ‘simplify’ things by having your tool recognize when there is ambiguity, and then ‘just’ providing multiple options for the user to choose from, rather than trying to have everything automagically selected by the tool itself.
I would recommend checking out how the browser plugin 10ten Japanese Reader handles the same/similar situation (or perhaps an equivalent one; perhaps Yomichan or whatever the latest is called; but I haven’t used it myself, so I don’t know if it works the same).
For example, with 避ける, when there are multiple possible readings/meanings for the same literal characters, 10ten will:
Merge similar readings (and their related nuanced meanings) into a single ‘dictionary entry’ containing all of the readings and meanings, leaving it to the user to differentiate which one is intended, such as (my annotations {like this} ):
避ける [さける, よける]
(1) (v1,vt) to avoid (physical contact with) {typically よける}
(2) (v1,vt) to avoid (situation); to evade (question, subject); to shirk (one’s responsibilities) {typically さける}
(3) (v1) to ward off; to avert {either/both}
(4) (v1,vt) to put aside; to move out of the way {typically よける}
Separate out significantly distinct meanings into distinct ‘dictionary entries’, and simply listing all the matching dictionary entries to allow the user to select from them. A simple example is 様, with multiple readings/meanings. (NB: this would require, equivalently, on BP, that there be multiple distinct Vocabs with the same literal character string; there are indeed multiple such cases in existing BP Vocab, 様 being but one example.) 10ten simply matches them all, and presents them in a simple consecutive list form, like so:
様 [さま]
(1) (suf) (hon) Mr; Mrs; Miss; Ms (after a person’s name, position, etc.)
(2) (suf) (pol) makes a word more polite (usu. in fixed expressions) (usu. after a noun or na-adjective prefixed with お or ご)
(3) (n) state; situation; appearance; manner
様 [よう]
(1) (n-suf,n) (uk) appearing …; looking … (usu. after the -masu stem of a verb)
(2) (n-suf,n) way to …; method of …ing (usu. after the -masu stem of a verb)
(3) (n-suf,n) form; style; design (usu. after a noun)
(4) (n-suf,n) like; similar to (usu. after a noun)
(5) (n) thing (thought or spoken)
様, 態 [ざま, ザマ]
(1) (n) (derog,uk) mess; sorry state; plight; sad sight
(2) (suf) -ways; -wards (indicates direction)
(3) (suf) in the act of …; just as one is … (after the -masu stem of a verb)
(4) (suf) manner of …; way of … (after the -masu stem of a verb)
例, 様, 例し [ためし]
(n) precedent; example
In the case of おさめる, 10ten does a mixture of these two approaches. For example, it merges 収める and 納める, since there’s a lot of overlap there (but basically because that’s how Jisho/JMdict handles it, considering them both as different possible kanji for the same essential ‘dictionary entry’), but has distinct ‘dictionary entries’ for 修める and also for 治める. Again, it simply matches all of them and then lists them all in a row, allowing the user to make the choice between them.
収める, 納める [おさめる] { merged }
(1) (v1,vt) to put (into); to put away (in); to put back (in); to keep (in); to store (in); to restore (to its place)
(2) (v1,vt) to include (in an anthology, catalogue, etc.); to contain; to publish (in); to capture (on film)
(3) (v1,vt) to achieve (results, success, etc.); to obtain; to get; to gain; to win; to make (a profit)
(4) (v1,vt) to pay (fees, taxes, etc.); to deliver; to supply
(5) (v1,vt) to accept (a gift or money)
(6) (v1,vt) to keep (within a limit)
(7) (v1,vt) to offer (to a shrine, deity, etc.); to dedicate
(8) (v1,vt) to subdue; to suppress; to settle
(9) (v1,vt,suf) to finish; to conclude; to wind up; to bring to a close
修める [おさめる] { distinct }
(1) (v1,vt) to study; to complete (a course); to cultivate; to master
(2) (v1,vt) to order (one’s life)
(3) (v1,vt) to repair (a fault one has committed)
治める [おさめる] { distinct }
(1) (v1,vt) to rule; to govern; to reign over; to administer; to manage (e.g. a household)
(2) (v1,vt) to subdue; to suppress; to quell; to settle (e.g. a dispute)
As for things like set-phrases, which are often composed of smaller, individual dictionary entries, 10ten ‘solves’ this issue by trying to match as greedily as possible (to match as much of the text as possible, given the entries it has available to match with), and so it will always give you a long set-phrase as the first result, if such a match is possible. However, 10ten also will match any smaller entries that appear under the cursor. It then, once again, simply lists all the matches together, from ‘greediest’/largest match to smallest. For example, for 持ち上げる, it provides this list:
持ち上げる, 持上げる [もちあげる]
(1) (v1,vt) to elevate; to raise; to lift up
(2) (v1,vt) to flatter; to extol; to praise to the sky
持ち [もち]
(1) (n,n-suf) having; holding; possessing; owning; using; holder; owner; user
(2) (n,n-suf) wear; durability; life (also written as 保ち)
(3) (n,n-suf) charge; expense
(4) (n) (form) draw (in go, poetry contest, etc.); tie
… (It actually continues to ‘drill down’ in the match with entries such as 持 [じ]; but it also does other ‘smart’ matching such as recognizing the root verb from a ‘conjugation’, so it also lists 持つ, and marks it as being related to 持ち, because 持ち is the ‘masu stem’; really should check this plugin (or similar one) for inspiration, IMHO )
Obviously, this is only a workaround. There is usually only one ‘correct’, ‘true’ meaning/nuance implied by a given text, and thus also usually only one ‘correct’ kanji/reading (even if the literal word is written with kana only).
But it’s a useful workaround in the ‘Pareto’ sense that if you could get this working, then you’ve gotten 80% of the usability that you want with only 20% of the effort.
It at least gives the user access to the correct meaning/nuance/reading/kanji, although it does leave that final ‘disambiguation’ up to the user.
And I guarantee you that it will be much easier to implement such a tool than a completely ‘automagic’ tool that correctly guesses everything on its own.
Even if you merely use this kind of 10ten-inspired interface behind-the-scenes at BP to help you guys quickly and efficiently select the correct option out of the ambiguities, rather than deferring that choice to the end-users (us! ), it would save you guys a lot of time, hassle, and human-error / typo related bugs.
Also, by developing this kind of half-way interface, you guys will gain a lot of understanding and knowledge of what it will take to finally go ‘all the way’ towards a ‘fire and forget’ automagic tool that would do most/all of this additional selection for you. So, trying this first wouldn’t actually be a detour, but more like ‘laying the groundwork’ for the full tool, if you end up putting in the time/effort to implement the full tool in the end.
I could say a ton more, but that’s good for now. Hope this idea helps somewhat!
This might be an obvious thing, but, as far as I’ve seen, words in the decks appear in the text sequencing order, i.e. in the order of appearance within the book.
This is quite useful whilst reading, as you can read along the vocab deck quite smoothly.
As we have been working on this tool for quite some time in the background, we already have plans of attack for most of the potential problems that you have expressed here. Rather than being magic, almost everything is figured out through probability systems that we program in ourselves, and will continue to program in until we get it as close to perfect as possible.
For example with the さける よける example, we can set up things like this to check the surrounding vocabulary in order to decide the likely reading. For example, if the surrounding words contain something like 攻撃, 拳, 突撃, 一発, パンチ, or really anything that indicates a ‘physical attack’, then we could prioritize よける being chosen.
However, we will be setting up a reporting system on words that can be read in different ways, as there is only so close we can get to perfect. Naturally there will always be situations where an author intentionally uses a word in an ‘unconventional’ way to get a point across, or even as a joke.
This one is already on our list of things to implement. On our back end, we have separated out all words that have the same reading (for verb families), and will be creating linked vocab trees in a similar fashion to the way that synonyms have always been presented on Bunpro grammar. For example, when learning 収める, the entries of 納める, 治める, 修める will appear as synonyms, and be linked as their own individual vocab items.
For what we are doing this for and what we aren’t doing it for, there is actually a technical term for this in Japanese. They are called 同訓異字. Any word that is 同訓異字, we will be building synonym trees for, as they actually are synonyms in terms of their meanings being the same in at least some sense.
We already have this set up for grammar detection, and for vocab as well. It is definitely the best way to catch phrases. We will also be setting up phrase recognition when particles are omitted, such as when 気が付いた is said as 気付いた.
We know that we can’t get it perfect. But we can get it to something above 95% accuracy with intense training, which is what we intend to do.
At the moment, we’re actually fairly close to implementing a lot of fixes in the system that will get this working with a very high level of accuracy. However, we are definitely in the stage at the moment of wanting our users to be very strict with what the system itself spits out for decks. As you state yourself, the more that we can understand about the errors, the easier it will be for us to develop onward toward a ‘near’ perfect tool.
When we do get the tool up to the desired standard, we would love for users to be able to use it for their own purposes, like for the type of content you mentioned. Actually having this tool available to users themselves though is something that we would have to figure out logistically, so may take some time. Fingers crossed!
Hi there! I’m going through the vocabulary deck for スーパーカブ and wondering if I can filter it. It’d be useful to filter words I’ve learned versus those I haven’t and by JLPT level. Also, an option to mark all JLPT N5 words as known would be great. Plus, having a setting to focus on learning vocab above a certain JLPT level, like N4 or higher, would be helpful.
With jpdb and the ttsu reader app and jpdb-reader extension for Chrome, you can see which words in the text you know and don’t know. It gives a quick overview of what’s left to learn. It would be really nice to have a similar function in bunpro.
Will keep a log of requests for deck capability at the top of this post from things that users mention here and in other threads so that our developers are able to get on top of as many of them as soon as possible.
Only 2 of the vocab decks is available in the app (Yotsuba is missing). For the two decks that are available, the number of items in the decks is different between the app and the web site.
for the star wars deck, the novel says 「時空を超越した偉大な国家であった」
the word after 国家 in the deck is 出会う. I might be wrong, but it seems like it should be である instead.
in the future, should we continue posting any possible errors here or should we post them in the related book club thread?
I’ll send you a pm with a full list of the ones I find after having read as far as my free e-book trial allows me to. for now I’ve found another one. 「この国がどこにあったのか,…」should most likely just be the past tense for ある, but shows up as 似合う in the deck.
Is the tool capable of handling set phrases and compound verbs?
I’ve just encountered ‘手をつける’ in よつば (page 14, bottom right-hand frame). In スーパーカブ there are quite a few compound verbs that show up as components in the vocab deck (盛り足す or 切り崩す come to mind).
It sure can. We just need to add ones that are missing. Some of them are not actual sayings, while some of them are. For example 盛り足す is not a standard saying, it’s just something the author has made up to add emphasis. In most cases, we’ll only add the ones that are actual phrases to the decks (to save people remembering something which is not normal Japanese).