برنامه نویسی

Corpus & Vocabulary – Community Dev

ek3nk4r 4 هفته پیش

0 0 خواندن این مطلب 2 دقیقه زمان میبرد

پیشنهاد ویژه

خرید فالوور واقعی خرید لایک اینستاگرام خرید ویو اینستاگرام خرید فالوور اینستاگرام

جسد مجموعه متن است. این می تواند چندین پاراگراف به کل کتاب باشد.

حالا سوال پیش می آید که چرا ما در مورد آن صحبت می کنیم؟

خوب به زبان پردازش طبیعی می تواند چندین پاراگراف وجود داشته باشد
یا متن طولانی برای تجزیه و تحلیل ما می خواهیم یاد بگیریم که چگونه با متن عظیم مقابله کنیم.

برای تجزیه و تحلیل متن باید این مراحل یا پیش پردازش Corpus را به خاطر بسپارید:

تطابق
حذف کلمه را متوقف کنید
حذف شخصیت خاص
تبدیل حروف کوچک

from nltk.tokenize import word_tokenize

corpus="In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual."

print(word_tokenize(word))

['In', 'order', 'to', 'make', 'the', 'corpora', 'more', 'useful', 'for', 'doing', 'linguistic', 'research', ',', 'they', 'are', 'often', 'subjected', 'to', 'a', 'process', 'known', 'as', 'annotation', '.', 'An', 'example', 'of', 'annotating', 'a', 'corpus', 'is', 'part-of-speech', 'tagging', ',', 'or', 'POS-tagging', ',', 'in', 'which', 'information', 'about', 'each', 'word', "'s", 'part', 'of', 'speech', '(', 'verb', ',', 'noun', ',', 'adjective', ',', 'etc', '.', ')', 'is', 'added', 'to', 'the', 'corpus', 'in', 'the', 'form', 'of', 'tags', '.', 'Another', 'example', 'is', 'indicating', 'the', 'lemma', '(', 'base', ')', 'form', 'of', 'each', 'word', '.', 'When', 'the', 'language', 'of', 'the', 'corpus', 'is', 'not', 'a', 'working', 'language', 'of', 'the', 'researchers', 'who', 'use', 'it', ',', 'interlinear', 'glossing', 'is', 'used', 'to', 'make', 'the', 'annotation', 'bilingual', '.']

2. حذف کلمه را متوقف کنید و مورد کمتری را تبدیل کنید:

from nltk.corpus import stopwords
for word in word_tokenize(corpus):
    if (word.lower() not in stopwords.words('english') and (len(word)>=2)):
        print(word)

Output:
order
make
corpora
useful
linguistic
research
often
subjected
process
known
annotation
example
annotating
corpus
part-of-speech
tagging
POS-tagging
information
word
's
part
speech
verb
noun
adjective
etc
added
corpus
form
tags
Another
example
indicating
lemma
base
form
word
language
corpus
working
language
researchers
use
interlinear
glossing
used
make
annotation
bilingual

3. در لیست قرار دهید:

words=[]
for word in word_tokenize(corpus):
    if (word.lower() not in stopwords.words('english') and (len(word)>=2)):
        words.append(word.lower())

print(words)

Output:
['order', 'make', 'corpora', 'useful', 'linguistic', 'research', 'often', 'subjected', 'process', 'known', 'annotation', 'example', 'annotating', 'corpus', 'part-of-speech', 'tagging', 'pos-tagging', 'information', 'word', "'s", 'part', 'speech', 'verb', 'noun', 'adjective', 'etc', 'added', 'corpus', 'form', 'tags', 'another', 'example', 'indicating', 'lemma', 'base', 'form', 'word', 'language', 'corpus', 'working', 'language', 'researchers', 'use', 'interlinear', 'glossing', 'used', 'make', 'annotation', 'bilingual']

4. کلمات منحصر به فرد:

print(set(words))

Output:
['order', 'make', 'corpora', 'useful', 'linguistic', 'research', 'often', 'subjected', 'process', 'known', 'annotation', 'example', 'annotating', 'corpus', 'part-of-speech', 'tagging', 'pos-tagging', 'information', 'word', "'s", 'part', 'speech', 'verb', 'noun', 'adjective', 'etc', 'added', 'corpus', 'form', 'tags', 'another', 'example', 'indicating', 'lemma', 'base', 'form', 'word', 'language', 'corpus', 'working', 'language', 'researchers', 'use', 'interlinear', 'glossing', 'used', 'make', 'annotation', 'bilingual']

واژگان در زمینه NLP ، به کلمات منحصر به فرد در جسد اشاره دارد. پس از پیش پردازش کل کلمات منحصر به فرد.

برای دیدن تفاوت پس از حذف کلمه توقف و دریافت ارزش منحصر به فرد از کلمات

print(len(words))
print(len(set(words)))

Output:
49
41

ek3nk4r 4 هفته پیش

0 0 خواندن این مطلب 2 دقیقه زمان میبرد

Corpus & Vocabulary – Community Dev

پیشنهاد ویژه

ek3nk4r

دیدگاهتان را بنویسید لغو پاسخ

Blox Fruits Codes (آوریل 2023)

پیش‌بینی قیمت لایت‌کوین به‌عنوان رویکردهای رویداد نصف‌کننده – آیا LTC می‌تواند از اینجا 100 برابر کند؟

بهترین وینگرها در فیفا 23

نسخه ی نمایشی نوار ماکت AI توضیح داد

SoftBank 215 میلیون دلار از سهام Paytm هند را می فروشد: گزارش

چگونه خاک اره را قبل از رنگ آمیزی چوب تمیز کنیم؟ [Solved] 2022

سقوط FTX به وام دهنده کریپتو Genesis رسید. Bankman-Fried، افراد مشهور شکایت کردند

تایوان به دنبال معافیت های مالیاتی بزرگ تری برای تحقیق و توسعه فناوری است تا بتواند مزیت رقابتی خود را حفظ کند

چگونه پیام ارسالی در اینستاگرام را حذف کنیم؟ [Solved] 2022

پیشنهاد ویژه

ek3nk4r

در خبرنامه سایت ما عضو شوید و جدیدترین ها را در ایمیل خود دریافت کنید

چگونه به عنوان Raider در Elden Ring Nightreign بازی کنیم

⚡ انتخاب نهایی برای ساختن خدمات وب مدرن: یک چارچوب سبک وزن HTTP - Community Dev

نوشته های مشابه

نسخه ی نمایشی نوار ماکت AI توضیح داد

گره سفارشی N8N خود را ایجاد کنید

🚀 چرا Cloud Native & DevOps برای هر شرکت مدرن ضروری است؟

توسعه Harmonyos (XI): اجرای صفحه برای ارسال اطلاعات شغلی

دیدگاهتان را بنویسید لغو پاسخ

Blox Fruits Codes (آوریل 2023)

پیش‌بینی قیمت لایت‌کوین به‌عنوان رویکردهای رویداد نصف‌کننده – آیا LTC می‌تواند از اینجا 100 برابر کند؟

بهترین وینگرها در فیفا 23

نسخه ی نمایشی نوار ماکت AI توضیح داد

SoftBank 215 میلیون دلار از سهام Paytm هند را می فروشد: گزارش

چگونه خاک اره را قبل از رنگ آمیزی چوب تمیز کنیم؟ [Solved] 2022

سقوط FTX به وام دهنده کریپتو Genesis رسید. Bankman-Fried، افراد مشهور شکایت کردند

تایوان به دنبال معافیت های مالیاتی بزرگ تری برای تحقیق و توسعه فناوری است تا بتواند مزیت رقابتی خود را حفظ کند

چگونه پیام ارسالی در اینستاگرام را حذف کنیم؟ [Solved] 2022