Community Decks (Beta) - Nov 19th 2024

DeclanF · November 22, 2024, 2:39am

I looked into seeing if I could get game data into custom decks; however, the issue is more figuring out where spaces are inside Japanese sentences than extracting the text. If anyone has any tools for extracting words from text, or even if Bunpro could use the massive vocabulary data they have to allow importing from plain text, I would appreciate it.

Flutter · November 22, 2024, 8:49am

I found the best way is using MeCab along with some dictionary. You can look at the tokenizer.py in the repository I posted in this thread a few posts ago, if you’re using Python I think you could even just import my package and use the method directly to turn a list of strings to a set of vocab items.

danyramdas · November 22, 2024, 10:36am

I slowly working on this. I use the vetted list at jpdb.io as source.
Currently 4 chapters done.

nemuikamu · November 22, 2024, 8:57pm

this looks amazing!
I’m excited to get decks related to manga/anime/movies to help me study phrases and vocab that’s related specifically to the content

in particular, I’d like to see ones for Solo Leveling and Perfect Days

DeclanF · November 23, 2024, 2:21am

Thanks! I went a little crazy and made a tab-separated CSV file for myself of the vocabulary from yokai watch 1, since I’ve heard that game is good for beginners (and it’s fun).

github.com

Declan-F/Shape-Lagging-Elsewhere/blob/main/finaltext.csv

さ	394
ケータ	31
君	245
で	2131
は	3731
妖怪	837
探し	14
と	1845
参る	11
ます	1888
か	2882
うん	221
今日	199
何処	165
に	4454
行く	837
至る	3
所	230
潜む	15
て	5387

This file has been truncated. show original

AndrewA · November 23, 2024, 3:10am

This is cool! Someone should make Japanese from Zero decks (someone might be me).

danyramdas · November 23, 2024, 3:14am

Word of caution: Not sure how you will handle the revision of the books. But it’s prob something to be aware of.

veritas_nz · November 23, 2024, 5:14am

@Flutter

The import tool will only be for importing Vocab.
Will only allow one instance of a “query” string (a line of Japanese text in the CSV), so would remove duplicate Vocab from the list before searching for them.

After parsing everything, it will search for the items in our Vocab DB.
Anything it can’t find will be added to a separate “overflow” (broken item) list for manual-searching/deletion.

I’m thinking of having some sort of custom formatting for the import so it will allow for automatic Unit creation too.

Something like this using hashes for Unit titles:

# Unit One,
## This is the first unit description,
林檎,
みかん,
友達,

# Unit Two,
## This is the second unit description,
犬,
猫,

Also out of curiosity, would you find a “Import via JMDict ID” feature useful too?
Using the JMDict ID would result in much more accurate importing.

Flutter · November 23, 2024, 8:18am

The current default behavior of my script just outputs a long one-column csv of Japanese dictionary forms anyway, so I think that’s perfectly fine. I could add/replace them with IDs too but I confirm all vocab with JMDict anyway because of stuff like tokenization mistakes and image recognition errors, but if you add that feature I’ll definitely make it an option, I’m sure it wouldn’t hurt

JamesBunpro · November 23, 2024, 9:42am

Any way to specify a reading when importing? E.g., 明日(あす) etc. Will your parser go for the most common reading by default? I think that in itself is tricky as normally frequency data goes by the way something is written as automatically differentiating readings is a nightmare. Perhaps flag every word with multiple common readings when imported so the user can choose what they think is best? I am guessing that could also be a headache to implement but just trying to think of simple solutions for this issue. Even if you don’t have any way to solve this when you first release something I suspect it may come up as an issue later down the line, especially if it becomes a popular tool, so thought I’d flag it. Thanks for the hard work!

Flutter · November 23, 2024, 10:26am

I just added both features you suggested to my program, along with some recommended commands for Bunpro in the readme.

Also, @JamesBunpro I’m not sure this is feasible as any CSV will likely consist of automatically generated data anyway, which means the program likely couldn’t tell which exact reading it is in that case either, unless maybe you’d directly include data from furigana. I think if anything, alternate readings should just be displayed on each vocab page. I’m also not sure asking the user is particularly useful, at least if the user is importing a deck for themselves, because the whole point is that they probably don’t know the word yet so how would they know which one to choose, not to mention the word will have been taken out of context already anyway. I think the solution that it just points to the dictionary entry on Bunpro and you go from there is probably good enough.

nfive · November 23, 2024, 1:19pm

This sounds like an amazing addition!!

I’m still a beginner in Japanese, and appart from it I have no idea of how to use the tools that are being mentioned here, as scripts, parsers, CSV, even Anki!.. are all unfamiliar to me. But I trust on the power of the community, and that people more versed on these will be creating great things!

Personally I’d love that at some point this means we can study from some popular decks, as core 2000 or core 6000, (but don’t know if that is feasible to do?) or to have small thematic decks on interesting topics (I’d love a space/universe vocabulary deck, music deck, cinema deck… with any words useful for reading or talking about those topics)

Thanks all for continuing making this even a better learning tool!

Magyarapointe · November 23, 2024, 5:09pm

@Flutter thank you for that amazing tool ! I just have a couple of remarks/questions

when I first tried to run it (on Mac, Python 3.9.12 through pyenv), I ran into an error for lines 57 and 67 of /site-packages/sample/main.py : in the

logging.info(f"Vocabulary: {", “}.join(list(vocab)[:50])}, …”)

line, my python didn’t like that there were only double quotes. (The error message was
SyntaxError: f-string: expecting ‘}’

which Stack Overflow told me was about using the same kind of quotes twice in the f-query. I manually changed the inner (I think it would work with the outer, too) quotes to single quotes, so that it now reads

logging.info(f"Vocabulary: {’, '}.join(list(vocab)[:50])}, …")

and the error went away. I don’t know if anyone else ran into that error, or if its my setup that’s faulty.

is there a way to run your program on manga that I have already mokuro’ed ? mokuro is great, but very time- and power-consuming, and I’d rather not re-run it if possible. I suppose your program runs either on the html or the .mokuro file ? could there be an option to feed it only one of those ?

Thank you for the hard work !!

Flutter · November 23, 2024, 5:23pm

Thanks for the notice, I’ll change it right away! If you left all the mokuro files where they were generated, mokuro will realize it already processed them. I think this requires to _ocr folder to be present in the directory the .html and .mokuro files are also generated in (this is also necessary for my script, because that’s where it gets the lines from).

Magyarapointe · November 23, 2024, 6:14pm

Ah, you’re right. It won’t rerun the whole mokuro process, just check that everything is ok and then go on to produce the vocab file. Thanks !

haldo · November 23, 2024, 8:10pm

I like the format.
I think we should have a prioritization of the duplicate vocabulary of a csv on the first occurrence, (if a word appears in unit 2 and unit 5, it must be learned on the 2).
All this will increase the coverage of the CSV as JMDict ID seems to be a good option, but there may be chances that we get false chances.
Maybe a result page that allows to have a detail on this kind of resolution could be useful.

Flutter · November 23, 2024, 8:15pm

I think it’d be good if they implemented this as well if you do provide a csv with duplicates and sections, but even if they don’t, doing this yourself is really simple so at least that’s always a possibility

Jake · November 25, 2024, 7:34am

Just a small update:

We made some changes for how the url of the deck works. Previously there was an issue with all Japanese titles not converting to urls. I have added the ability for you to customize how the url of your deck looks. Please note that it only accepts alphanumeric input.

image875×109 7.07 KB
I have added a little refresh icon on hover to decks. This will refresh your progress if it is outdated without needing to open the deck

haldo · November 25, 2024, 11:25pm

I made a scrapper of jpdb.io which exports in csv an anime with this format.
Here’s a sample file from cowboy bebop if you want to try:
https://pastebin.com/raw/VjpfRp8n

craze1x · November 25, 2024, 10:28pm

Hey, I was wondering where you got the script for Yokai Watch to make this?