Where do the Vocab Frequency Lists Come From?

Apologies if this is addressed somewhere (tried searching, couldn’t find it), but I mean these lists specifically:

image

Like, where does the Bunpro team get it from? I’m also curious what is meant by “General” frequency. We already have default frequency (what Bunpro recommends for ease of learning), dictionary (I guess just pulled from an online dictionary), but where does that leave General?

8 Likes

My concern is that despite setting “Anime” as the priority, I’ve only seen Anime frequencies in the 500-1000 range when I was expecting it to start at 1 and descend from there.

Bumping this thread tho, also curious where the data is pulled from.

1 Like

The word frequency data we use is aggregate data from a list that you can find online. Much like the JLPT data, there isn’t really any “true” frequency list because whatever subset of anime/novels etc is used to make a frequency list will end up biasing the data.

@TangoTangoSIerra The data does start from 1 but in most cases the first couple hundred are going to be particles and super basic common words like いい, する, ある

6 Likes

@Jake
Thank you for your reply!
I understand, but would it really go as low as 1,000+? Or do I have a setting wrong somewhere?

I’m not really sure what you mean by go as low as 1000+. Each word takes up one spot so unless there are less than 1000 words in a frequency list, it would go beyond that pretty easily.

Sorry if I was unclear. What I mean is this: If I’m sorting by “Anime - Most Frequent” and the “first couple hundred are going to be particles and super basic common words like いい, する, ある” as you said, I am still seeing the frequency of the anime word as “Top 1,000” with the actual number being 1,139 (for example). Shouldn’t the frequency number be a lot higher? (lower number, technically)

my guess is you already know the other 1138 words in the list. For example I’m know few words and the words I get when studying the top 2k sorted by generaly frequency are much closer to the beginning of the list (see screenshot).

1 Like

Ah I see where the confusion might be. The frequency is global rather than based on the deck.

So a deck that only has rare words from anime might only ever show 1k+ or even 10k+ because there are tens of thousands of words in each frequency list.

2 Likes