Community Decks (Beta) - Nov 19th 2024

I looked into seeing if I could get game data into custom decks; however, the issue is more figuring out where spaces are inside Japanese sentences than extracting the text. If anyone has any tools for extracting words from text, or even if Bunpro could use the massive vocabulary data they have to allow importing from plain text, I would appreciate it.

2 Likes

I found the best way is using MeCab along with some dictionary. You can look at the tokenizer.py in the repository I posted in this thread a few posts ago, if you’re using Python I think you could even just import my package and use the method directly to turn a list of strings to a set of vocab items.

2 Likes

I slowly working on this. I use the vetted list at jpdb.io as source.
Currently 4 chapters done.

1 Like

this looks amazing!
I’m excited to get decks related to manga/anime/movies to help me study phrases and vocab that’s related specifically to the content :slight_smile:

in particular, I’d like to see ones for Solo Leveling and Perfect Days

2 Likes

Thanks! I went a little crazy and made a tab-separated CSV file for myself of the vocabulary from yokai watch 1, since I’ve heard that game is good for beginners (and it’s fun).

2 Likes

This is cool! Someone should make Japanese from Zero decks (someone might be me).

1 Like

Word of caution: Not sure how you will handle the revision of the books. But it’s prob something to be aware of.

1 Like

@Flutter

The import tool will only be for importing Vocab.
Will only allow one instance of a “query” string (a line of Japanese text in the CSV), so would remove duplicate Vocab from the list before searching for them.

After parsing everything, it will search for the items in our Vocab DB.
Anything it can’t find will be added to a separate “overflow” (broken item) list for manual-searching/deletion.

I’m thinking of having some sort of custom formatting for the import so it will allow for automatic Unit creation too.

Something like this using hashes for Unit titles:
# Unit One,
## This is the first unit description,
林檎,
みかん,
友達,

# Unit Two,
## This is the second unit description,
犬,
猫,

Also out of curiosity, would you find a “Import via JMDict ID” feature useful too?
Using the JMDict ID would result in much more accurate importing.

4 Likes

The current default behavior of my script just outputs a long one-column csv of Japanese dictionary forms anyway, so I think that’s perfectly fine. I could add/replace them with IDs too but I confirm all vocab with JMDict anyway because of stuff like tokenization mistakes and image recognition errors, but if you add that feature I’ll definitely make it an option, I’m sure it wouldn’t hurt

1 Like

Any way to specify a reading when importing? E.g., 明日(あす) etc. Will your parser go for the most common reading by default? I think that in itself is tricky as normally frequency data goes by the way something is written as automatically differentiating readings is a nightmare. Perhaps flag every word with multiple common readings when imported so the user can choose what they think is best? I am guessing that could also be a headache to implement but just trying to think of simple solutions for this issue. Even if you don’t have any way to solve this when you first release something I suspect it may come up as an issue later down the line, especially if it becomes a popular tool, so thought I’d flag it. Thanks for the hard work!

3 Likes

I just added both features you suggested to my program, along with some recommended commands for Bunpro in the readme.

Also, @CursedKitsune I’m not sure this is feasible as any CSV will likely consist of automatically generated data anyway, which means the program likely couldn’t tell which exact reading it is in that case either, unless maybe you’d directly include data from furigana. I think if anything, alternate readings should just be displayed on each vocab page. I’m also not sure asking the user is particularly useful, at least if the user is importing a deck for themselves, because the whole point is that they probably don’t know the word yet so how would they know which one to choose, not to mention the word will have been taken out of context already anyway. I think the solution that it just points to the dictionary entry on Bunpro and you go from there is probably good enough.

3 Likes

This sounds like an amazing addition!!

I’m still a beginner in Japanese, and appart from it I have no idea of how to use the tools that are being mentioned here, as scripts, parsers, CSV, even Anki!.. are all unfamiliar to me. But I trust on the power of the community, and that people more versed on these will be creating great things!

Personally I’d love that at some point this means we can study from some popular decks, as core 2000 or core 6000, (but don’t know if that is feasible to do?) or to have small thematic decks on interesting topics (I’d love a space/universe vocabulary deck, music deck, cinema deck… with any words useful for reading or talking about those topics)

Thanks all for continuing making this even a better learning tool! :slight_smile:

4 Likes

@Flutter thank you for that amazing tool ! I just have a couple of remarks/questions

  • when I first tried to run it (on Mac, Python 3.9.12 through pyenv), I ran into an error for lines 57 and 67 of /site-packages/sample/main.py : in the

logging.info(f"Vocabulary: {", “}.join(list(vocab)[:50])}, …”)

line, my python didn’t like that there were only double quotes. (The error message was
SyntaxError: f-string: expecting ‘}’

which Stack Overflow told me was about using the same kind of quotes twice in the f-query. I manually changed the inner (I think it would work with the outer, too) quotes to single quotes, so that it now reads

logging.info(f"Vocabulary: {’, '}.join(list(vocab)[:50])}, …")

and the error went away. I don’t know if anyone else ran into that error, or if its my setup that’s faulty.

  • is there a way to run your program on manga that I have already mokuro’ed ? mokuro is great, but very time- and power-consuming, and I’d rather not re-run it if possible. I suppose your program runs either on the html or the .mokuro file ? could there be an option to feed it only one of those ?

Thank you for the hard work !!

2 Likes

Thanks for the notice, I’ll change it right away! If you left all the mokuro files where they were generated, mokuro will realize it already processed them. I think this requires to _ocr folder to be present in the directory the .html and .mokuro files are also generated in (this is also necessary for my script, because that’s where it gets the lines from).

1 Like

Ah, you’re right. It won’t rerun the whole mokuro process, just check that everything is ok and then go on to produce the vocab file. Thanks !

2 Likes

I like the format.
I think we should have a prioritization of the duplicate vocabulary of a csv on the first occurrence, (if a word appears in unit 2 and unit 5, it must be learned on the 2).
All this will increase the coverage of the CSV as JMDict ID seems to be a good option, but there may be chances that we get false chances.
Maybe a result page that allows to have a detail on this kind of resolution could be useful.

1 Like

I think it’d be good if they implemented this as well if you do provide a csv with duplicates and sections, but even if they don’t, doing this yourself is really simple so at least that’s always a possibility

1 Like

Just a small update:

  • We made some changes for how the url of the deck works. Previously there was an issue with all Japanese titles not converting to urls. I have added the ability for you to customize how the url of your deck looks. Please note that it only accepts alphanumeric input.

  • I have added a little refresh icon on hover to decks. This will refresh your progress if it is outdated without needing to open the deck

    image

9 Likes

I made a scrapper of jpdb.io which exports in csv an anime with this format.
Here’s a sample file from cowboy bebop if you want to try:
https://pastebin.com/raw/VjpfRp8n

2 Likes

Hey, I was wondering where you got the script for Yokai Watch to make this? :slight_smile:

2 Likes