You are a excellent NLP engineer. You have to generate a classification NLP model training and saving code considering the following descriptions. The model learned through your code will be used for competition submission, so you have to write it to show very sophisticated and good performance. You have to train 3 models for ensemble. Train the models separately but with same Dataset and DataLoader. Inference result on test dataset of each model will be summed according to Logit Ensemble. 1. Task: Classification of english news paper 2. Pre-trained Model: 2-1. abhishek/autonlp-bbc-news-classification-37229289 2-2. distilbert-base-uncased 2-3. valurank/finetuned-distilbert-news-article-categorization 3. Dataset: train.csv 3-1. columns: id, text, label id: sample id, text: news paper context, label: category(0-7) 4. Label mapping: 0: Sci/Tech, 1: Sports, 2: Business, 3: World, 4: Politics, 5: ESG, 6: Health, 7: Entertainment 5. Training Parameters: learning rate, batch size, number of epochs 6. Evaluation Metrics: Macro F1 Score 7. Hardware requirements: GPU - A6000 x 4 Here is the instructions. 0. Use huggingface AutoModel, AutoTokenizer, Trainer Library. 1. You have to generate only code not any explanation. The code you have to write is separated into following parts. 1. Load the train dataset from train path. Loaded dataset must be some multiple lists removing not necessary columns. 2. There is only train.csv dataset. Thus, create the train and validation dataset using train.csv. 3. Dataset and DataLoader. 4. Define the compute metrics function considering validation loss. 4. Define the model and training arguments. 5. Train the model and save the best model weights considering the macro f1 using Trainer Library. First, Write the code until step 3.
Sure, here's the code for the first three steps:
pythonCopy codeimport pandas as pd
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer
from torch.utils.data import Dataset, DataLoader
# Step 1: Load the train dataset
train_df = pd.read_csv('train.csv')
train_texts = train_df['text'].tolist()
train_labels = train_df['label'].tolist()
# Step 2: Create the train and validation dataset
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2, random_state=42)
# Step 3: Dataset and DataLoader
class NewsDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_len):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = self.texts[idx]
label = self.labels[idx]
encoding = self.tokenizer(text, padding='max_length', truncation=True, max_length=self.max_len, return_tensors='pt')
input_ids = encoding['input_ids'][0]
attention_mask = encoding['attention_mask'][0]
return {
'input_ids': input_ids,
'attention_mask': attention_mask,
'label': label
}
tokenizer = AutoTokenizer.from_pretrained('abhishek/autonlp-bbc-news-classification-37229289')
train_dataset = NewsDataset(train_texts, train_labels, tokenizer, max_len=512)
val_dataset = NewsDataset(val_texts, val_labels, tokenizer, max_len=512)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8)
Note that we're using the AutoTokenizer
from Hugging Face to load the pre-trained tokenizers for the three models specified in the prompt. We're also using the sklearn
library to split the train dataset into train and validation sets. We then define a NewsDataset
class to preprocess and encode the text data, and create DataLoader
objects for both the train and validation datasets.
Next, define the compute_metrics considering only macro_f1.
Sure, here's the code for the compute_metrics
function that considers only the macro F1 score:
pythonCopy codefrom sklearn.metrics import f1_score, precision_recall_fscore_support
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
# Macro F1 Score
macro_f1 = f1_score(labels, preds, average='macro')
return {'macro_f1': macro_f1}
Here, we're using the f1_score
function from sklearn.metrics
to compute the macro F1 score, and returning the result in a dictionary with the key macro_f1
. Note that we're also using the argmax
function to convert the model predictions to the label indices with the highest probability, as required by the f1_score
function.
Next, write the code for step 4: Define the models and training arguments. You have to consider the metrics return key. learning_rate: 1e-5 num_epochs: 5 batch_size: 16 fp16: True dataloader_num_workers=4 gradient_accumulation_steps=1
Sure, here's the code for step 4:
pythonCopy codefrom transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
# Step 4: Define the model and training arguments
model_names = [
'abhishek/autonlp-bbc-news-classification-37229289',
'distilbert-base-uncased',
'valurank/finetuned-distilbert-news-article-categorization'
]
models = [AutoModelForSequenceClassification.from_pretrained(name, num_labels=8) for name in model_names]
training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy='epoch',
learning_rate=1e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=5,
weight_decay=0.01,
push_to_hub=False,
logging_dir='./logs',
logging_steps=10,
load_best_model_at_end=True,
metric_for_best_model='macro_f1',
greater_is_better=True,
fp16=True,
dataloader_num_workers=4,
gradient_accumulation_steps=1
)
Here, we're defining the three models using the AutoModelForSequenceClassification
class from Hugging Face, and setting the num_labels
argument to 8 to match the number of categories in the dataset.
We're also defining the TrainingArguments
for the model training, which specifies various training parameters such as the learning rate, batch size, number of epochs, and whether to use mixed-precision training with FP16. Note that we're setting metric_for_best_model
to 'macro_f1'
to optimize for the macro F1 score, which is what we're using as the evaluation metric. Finally, we're setting load_best_model_at_end
to True
to load the best model weights based on the evaluation metric at the end of training.
You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
You're right, thanks for the suggestion! The ignore_mismatched_sizes
argument can be added to the from_pretrained
method to handle any mismatch in the number of labels between the pre-trained models and the target dataset. Here's the updated code for defining the models:
pythonCopy codefrom transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
# Step 4: Define the model and training arguments
model_names = [
'abhishek/autonlp-bbc-news-classification-37229289',
'distilbert-base-uncased',
'valurank/finetuned-distilbert-news-article-categorization'
]
models = [AutoModelForSequenceClassification.from_pretrained(name, num_labels=8, ignore_mismatched_sizes=True) for name in model_names]
training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy='epoch',
learning_rate=1e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=5,
weight_decay=0.01,
push_to_hub=False,
logging_dir='./logs',
logging_steps=10,
load_best_model_at_end=True,
metric_for_best_model='macro_f1',
greater_is_better=True,
fp16=True,
dataloader_num_workers=4,
gradient_accumulation_steps=1
)
Now the ignore_mismatched_sizes=True
argument is added to the from_pretrained
method call to handle any mismatch in the number of labels.
ValueError: --load_best_model_at_end requires the save and eval strategy to match, but found - Evaluation strategy: epoch - Save strategy: steps
You're right, thank you for pointing that out! The load_best_model_at_end
argument requires the save and evaluation strategies to match, so we need to update the save_strategy
argument in TrainingArguments
to epoch
to match the evaluation_strategy
. Here's the updated code:
pythonCopy codefrom transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
# Step 4: Define the model and training arguments
model_names = [
'abhishek/autonlp-bbc-news-classification-37229289',
'distilbert-base-uncased',
'valurank/finetuned-distilbert-news-article-categorization'
]
models = [AutoModelForSequenceClassification.from_pretrained(name, num_labels=8, ignore_mismatched_sizes=True) for name in model_names]
training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy='epoch',
learning_rate=1e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=5,
weight_decay=0.01,
push_to_hub=False,
logging_dir='./logs',
logging_steps=10,
save_strategy='epoch',
load_best_model_at_end=True,
metric_for_best_model='macro_f1',
greater_is_better=True,
fp16=True,
dataloader_num_workers=4,
gradient_accumulation_steps=1
)
Now the save_strategy
argument is set to 'epoch'
to match the evaluation_strategy
, so that load_best_model_at_end
can be used without error.
Next, write the code for step 5: Train the model and save the best model weights considering the macro f1 score using Trainer Library. Do not use data collator Train and save the model separately. save folder: best_model/ Also, include the code for tokenizer parallelism is false using os library.
Sure, here's the code for step 5:
pythonCopy codefrom transformers import Trainer
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Step 5: Train the model and save the best model weights
def train_and_save_model(model, training_args, train_dataset, val_dataset, output_dir):
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics
)
trainer.train()
trainer.save_model(output_dir)
for i, model in enumerate(models):
output_dir = f'best_model/{i+1}/'
os.makedirs(output_dir, exist_ok=True)
train_and_save_model(model, training_args, train_dataset, val_dataset, output_dir)
Here, we're defining a train_and_save_model
function that takes a single model, its corresponding training arguments, the train and validation datasets, and an output directory for saving the best model weights. We're then using the Trainer
class from Hugging Face to train the model and save the best model weights based on the macro F1 score.
We're also using the os
library to set the TOKENIZERS_PARALLELISM
environment variable to 'false'
, which disables the parallelism in tokenization to avoid running out of memory.
Note that we're training and saving each model separately in a subdirectory of best_model/
corresponding to its index in the models
list. The best model weights for each model will be saved in their respective subdirectory. We're not using a data collator here as we're using the default collator, which is suitable for sequence classification tasks like this one.
we have to train and test for the another model: roberta-base Please write the code for roberta-base, including tokenizer, dataset, dataloader, train and save model. You have to only write for roberta-base, not other models. Reuse the same dataset and dataloader, training arguments above but another tokenizer. Save the model on /best_model/4 .
Sure, here's the updated code for training and saving the roberta-base
model:
pythonCopy codefrom transformers import AutoModelForSequenceClassification
# Use a different tokenizer for RoBERTa
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
# Define a new dataset using the same train and validation data
train_dataset = NewsDataset(train_texts, train_labels, tokenizer, max_len=512)
val_dataset = NewsDataset(val_texts, val_labels, tokenizer, max_len=512)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8)
# Define the RoBERTa model and train and save it
roberta_model = AutoModelForSequenceClassification.from_pretrained('roberta-base', num_labels=8, ignore_mismatched_sizes=True)
roberta_output_dir = 'best_model/4/'
os.makedirs(roberta_output_dir, exist_ok=True)
train_and_save_model(roberta_model, training_args, train_dataset, val_dataset, roberta_output_dir)
Here, we're using the AutoTokenizer
to load the pre-trained tokenizer for roberta-base
, and defining a new NewsDataset
using the same train and validation data but with the new tokenizer. We're also creating new DataLoader
objects using the updated datasets.
We then define a new AutoModelForSequenceClassification
for roberta-base
, and train and save it using the same train_and_save_model
function as before. Note that we're saving the best model weights in a new subdirectory called best_model/4/
.
I want to submit the submision.csv to DACON contest. Please write the code that load the trained weights and submit the submision.csv for test.csv You have to logit ensemble on test.csv Instructions: 1. text.csv consist of two columns: id, text 2. after the trained weights are loaded, we must predict on the test.csv and submit the submssion.csv file. 3. Submission.csv file must be consisted of two columns: id, label 4. Use gpu to generate predictions 5. Use two tokenizers: 5-1. best_model/1, 2, 3: 'distilbert-base-uncased' 5-2. best_model/4: thanks 'roberta-base' First, define the new TestDataset and DataLoader considering CUDA out of memory. In the first step, you have to use different tokenizers per model considering the above information. Next, inference on test.csv with weighted ensemble on each model. model1: 0.3, model2: 0.15, model3: 0.15, model4: 0.4
Sure, here's the code for loading the trained weights and generating predictions on the test set using a weighted ensemble:
pythonCopy codeimport pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Define the new TestDataset with the same structure as NewsDataset
class TestDataset(Dataset):
def __init__(self, texts, tokenizer, max_len):
self.texts = texts
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = str(self.texts[idx])
inputs = self.tokenizer.encode_plus(
text,
None,
add_special_tokens=True,
max_length=self.max_len,
padding='max_length',
truncation=True,
return_token_type_ids=True,
)
return {
'input_ids': torch.tensor(inputs['input_ids'], dtype=torch.long),
'attention_mask': torch.tensor(inputs['attention_mask'], dtype=torch.long),
'token_type_ids': torch.tensor(inputs['token_type_ids'], dtype=torch.long),
}
# Define the ensembling weights for each model
model_weights = [0.3, 0.15, 0.15, 0.4]
# Load the test data from the CSV file
test_data = pd.read_csv('test.csv')
# Load the best model weights for each model
model_names = [
'best_model/1/',
'best_model/2/',
'best_model/3/',
'best_model/4/',
]
tokenizers = [
AutoTokenizer.from_pretrained('distilbert-base-uncased'),
AutoTokenizer.from_pretrained('distilbert-base-uncased'),
AutoTokenizer.from_pretrained('distilbert-base-uncased'),
AutoTokenizer.from_pretrained('roberta-base'),
]
models = [
AutoModelForSequenceClassification.from_pretrained(name, num_labels=8, ignore_mismatched_sizes=True).to('cuda')
for name in model_names
]
# Generate the predictions on the test set using a weighted ensemble
predictions = torch.zeros(len(test_data), 8, device='cuda')
for i, model in enumerate(models):
tokenizer = tokenizers[i]
test_dataset = TestDataset(test_data['text'], tokenizer, max_len=512)
test_loader = DataLoader(test_dataset, batch_size=8, num_workers=4)
model.eval()
with torch.no_grad():
for j, batch in enumerate(test_loader):
input_ids = batch['input_ids'].to('cuda')
attention_mask = batch['attention_mask'].to('cuda')
token_type_ids = batch['token_type_ids'].to('cuda')
outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
logits = outputs.logits
predictions[j*8:(j+1)*8, :] += model_weights[i] * logits
# Convert the predictions to labels and create a submission CSV file
labels = torch.argmax(predictions, dim=1).cpu().numpy()
submission = pd.DataFrame({'id': test_data['id'], 'label': labels})
submission.to_csv('submission.csv', index=False)
Here, we're defining a new TestDataset
to load the test data from the CSV file and tokenize it using the appropriate tokenizer for each model. We're then loading the best model weights for each model and defining the ensembling weights for each model.
We're then generating the predictions on the test set using a weighted ensemble, where each model's predictions are multiplied by its corresponding weight and
Include the tokenizer parallelism false
Sure, to disable parallelism in tokenization and avoid running out of memory, we can set the TOKENIZERS_PARALLELISM
environment variable to 'false'
using the os
library:
pythonCopy codeimport os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
We can add this code at the beginning of the script, before loading the tokenizers. The updated code looks like this:
pythonCopy codeimport pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Disable parallelism in tokenization to avoid running out of memory
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Define the new TestDataset with the same structure as NewsDataset
class TestDataset(Dataset):
def __init__(self, texts, tokenizer, max_len):
self.texts = texts
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = str(self.texts[idx])
inputs = self.tokenizer.encode_plus(
text,
None,
add_special_tokens=True,
max_length=self.max_len,
padding='max_length',
truncation=True,
return_token_type_ids=True,
)
return {
'input_ids': torch.tensor(inputs['input_ids'], dtype=torch.long),
'attention_mask': torch.tensor(inputs['attention_mask'], dtype=torch.long),
'token_type_ids': torch.tensor(inputs['token_type_ids'], dtype=torch.long),
}
# Define the ensembling weights for each model
model_weights = [0.3, 0.15, 0.15, 0.4]
# Load the test data from the CSV file
test_data = pd.read_csv('test.csv')
# Load the best model weights for each model
model_names = [
'best_model/1/',
'best_model/2/',
'best_model/3/',
'best_model/4/',
]
tokenizers = [
AutoTokenizer.from_pretrained('distilbert-base-uncased'),
AutoTokenizer.from_pretrained('distilbert-base-uncased'),
AutoTokenizer.from_pretrained('distilbert-base-uncased'),
AutoTokenizer.from_pretrained('roberta-base'),
]
models = [
AutoModelForSequenceClassification.from_pretrained(name, num_labels=8, ignore_mismatched_sizes=True).to('cuda')
for name in model_names
]
# Generate the predictions on the test set using a weighted ensemble
predictions = torch.zeros(len(test_data), 8, device='cuda')
for i, model in enumerate(models):
tokenizer = tokenizers[i]
test_dataset = TestDataset(test_data['text'], tokenizer, max_len=512)
test_loader = DataLoader(test_dataset, batch_size=8, num_workers=4)
model.eval()
with torch.no_grad():
for j, batch in enumerate(test_loader):
input_ids = batch['input_ids'].to('cuda')
attention_mask = batch['attention_mask'].to('cuda')
token_type_ids = batch['token_type_ids'].to('cuda')
outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
logits = outputs.logits
predictions[j*8:(j+1)*8, :] += model_weights[i] * logits
# Convert the predictions to labels and create a submission CSV file
labels = torch.argmax(predictions, dim=1).cpu().numpy()
submission = pd.DataFrame({'id': test_data['id
I am sorry, please fix the tokenizers for each model. best_model1: abhishek/autonlp-bbc-news-classification-37229289 best_model2: abhishek/autonlp-bbc-news-classification-37229289 best_model3: abhishek/autonlp-bbc-news-classification-37229289 best_model4: roberta-base
Sure, here's the updated code with the correct tokenizers for each model:
pythonCopy codeimport pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Disable parallelism in tokenization to avoid running out of memory
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Define the new TestDataset with the same structure as NewsDataset
class TestDataset(Dataset):
def __init__(self, texts, tokenizer, max_len):
self.texts = texts
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = str(self.texts[idx])
inputs = self.tokenizer.encode_plus(
text,
None,
add_special_tokens=True,
max_length=self.max_len,
padding='max_length',
truncation=True,
return_token_type_ids=True,
)
return {
'input_ids': torch.tensor(inputs['input_ids'], dtype=torch.long),
'attention_mask': torch.tensor(inputs['attention_mask'], dtype=torch.long),
'token_type_ids': torch.tensor(inputs['token_type_ids'], dtype=torch.long),
}
# Define the ensembling weights for each model
model_weights = [0.3, 0.15, 0.15, 0.4]
# Load the test data from the CSV file
test_data = pd.read_csv('test.csv')
# Load the best model weights for each model
model_names = [
'best_model/1/',
'best_model/2/',
'best_model/3/',
'best_model/4/',
]
tokenizers = [
AutoTokenizer.from_pretrained('abhishek/autonlp-bbc-news-classification-37229289'),
AutoTokenizer.from_pretrained('abhishek/autonlp-bbc-news-classification-37229289'),
AutoTokenizer.from_pretrained('abhishek/autonlp-bbc-news-classification-37229289'),
AutoTokenizer.from_pretrained('roberta-base'),
]
models = [
AutoModelForSequenceClassification.from_pretrained(name, num_labels=8, ignore_mismatched_sizes=True).to('cuda')
for name in model_names
]
# Generate the predictions on the test set using a weighted ensemble
predictions = torch.zeros(len(test_data), 8, device='cuda')
for i, model in enumerate(models):
tokenizer = tokenizers[i]
test_dataset = TestDataset(test_data['text'], tokenizer, max_len=512)
test_loader = DataLoader(test_dataset, batch_size=8, num_workers=4)
model.eval()
with torch.no_grad():
for j, batch in enumerate(test_loader):
input_ids = batch['input_ids'].to('cuda')
attention_mask = batch['attention_mask'].to('cuda')
token_type_ids = batch['token_type_ids'].to('cuda')
outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
logits = outputs.logits
predictions[j*8:(j+1)*8, :] += model_weights[i] * logits
# Convert the predictions to labels and create a submission CSV file
labels = torch.argmax(predictions, dim=1).cpu().numpy()
submission = pd.DataFrame({'id': test_data['id'], 'label': labels})
submission.to_csv('submission.csv', index=False)
Here, we're using the correct tokenizers for each model, as specified in the instructions. We're also setting the `TOKENIZERS
TypeError: forward() got an unexpected keyword argument 'token_type_ids'
I apologize for the error in my previous response. It looks like the RobertaForSequenceClassification
model doesn't have a token_type_ids
argument. Instead, it uses the attention_mask
argument to distinguish between the different sequences in the input.
Here's the corrected code for the TestDataset
class and the inference on the RobertaForSequenceClassification
model:
pythonCopy codeimport pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Disable parallelism in tokenization to avoid running out of memory
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Define the new TestDataset with the same structure as NewsDataset
class TestDataset(Dataset):
def __init__(self, texts, tokenizer, max_len):
self.texts = texts
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = str(self.texts[idx])
inputs = self.tokenizer.encode_plus(
text,
None,
add_special_tokens=True,
max_length=self.max_len,
padding='max_length',
truncation=True,
return_attention_mask=True,
)
return {
'input_ids': torch.tensor(inputs['input_ids'], dtype=torch.long),
'attention_mask': torch.tensor(inputs['attention_mask'], dtype=torch.long),
}
# Define the ensembling weights for each model
model_weights = [0.3, 0.15, 0.15, 0.4]
# Load the test data from the CSV file
test_data = pd.read_csv('test.csv')
# Load the best model weights for each model
model_names = [
'best_model/1/',
'best_model/2/',
'best_model/3/',
'best_model/4/',
]
tokenizers = [
AutoTokenizer.from_pretrained('abhishek/autonlp-bbc-news-classification-37229289'),
AutoTokenizer.from_pretrained('abhishek/autonlp-bbc-news-classification-37229289'),
AutoTokenizer.from_pretrained('abhishek/autonlp-bbc-news-classification-37229289'),
AutoTokenizer.from_pretrained('roberta-base'),
]
models = [
AutoModelForSequenceClassification.from_pretrained(name, num_labels=8, ignore_mismatched_sizes=True).to('cuda')
for name in model_names
]
# Generate the predictions on the test set using a weighted ensemble
predictions = torch.zeros(len(test_data), 8, device='cuda')
for i, model in enumerate(models):
tokenizer = tokenizers[i]
test_dataset = TestDataset(test_data['text'], tokenizer, max_len=512)
test_loader = DataLoader(test_dataset, batch_size=8, num_workers=4)
model.eval()
with torch.no_grad():
for j, batch in enumerate(test_loader):
input_ids = batch['input_ids'].to('cuda')
attention_mask = batch['attention_mask'].to('cuda')
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
predictions[j*8:(j+1)*8, :] += model_weights[i] * logits
# Convert the predictions to labels and create a submission CSV file
labels = torch.argmax(predictions, dim=1).cpu().numpy()
submission = pd.DataFrame({'id': test_data['id'], 'label': labels})
submission.to_csv('submission.csv', index=False)
Here, we're using the correct
rewrite the code for following weights: model1: 0.2, model2: 0.1, model3: 0.1, model4: 0.6
Sure, here's the updated code with the new ensembling weights:
pythonCopy codeimport pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Disable parallelism in tokenization to avoid running out of memory
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Define the new TestDataset with the same structure as NewsDataset
class TestDataset(Dataset):
def __init__(self, texts, tokenizer, max_len):
self.texts = texts
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = str(self.texts[idx])
inputs = self.tokenizer.encode_plus(
text,
None,
add_special_tokens=True,
max_length=self.max_len,
padding='max_length',
truncation=True,
return_attention_mask=True,
)
return {
'input_ids': torch.tensor(inputs['input_ids'], dtype=torch.long),
'attention_mask': torch.tensor(inputs['attention_mask'], dtype=torch.long),
}
# Define the ensembling weights for each model
model_weights = [0.2, 0.1, 0.1, 0.6]
# Load the test data from the CSV file
test_data = pd.read_csv('test.csv')
# Load the best model weights for each model
model_names = [
'best_model/1/',
'best_model/2/',
'best_model/3/',
'best_model/4/',
]
tokenizers = [
AutoTokenizer.from_pretrained('abhishek/autonlp-bbc-news-classification-37229289'),
AutoTokenizer.from_pretrained('abhishek/autonlp-bbc-news-classification-37229289'),
AutoTokenizer.from_pretrained('abhishek/autonlp-bbc-news-classification-37229289'),
AutoTokenizer.from_pretrained('roberta-base'),
]
models = [
AutoModelForSequenceClassification.from_pretrained(name, num_labels=8, ignore_mismatched_sizes=True).to('cuda')
for name in model_names
]
# Generate the predictions on the test set using a weighted ensemble
predictions = torch.zeros(len(test_data), 8, device='cuda')
for i, model in enumerate(models):
tokenizer = tokenizers[i]
test_dataset = TestDataset(test_data['text'], tokenizer, max_len=512)
test_loader = DataLoader(test_dataset, batch_size=8, num_workers=4)
model.eval()
with torch.no_grad():
for j, batch in enumerate(test_loader):
input_ids = batch['input_ids'].to('cuda')
attention_mask = batch['attention_mask'].to('cuda')
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
predictions[j*8:(j+1)*8, :] += model_weights[i] * logits
# Convert the predictions to labels and create a submission CSV file
labels = torch.argmax(predictions, dim=1).cpu().numpy()
submission = pd.DataFrame({'id': test_data['id'], 'label': labels})
submission.to_csv('submission.csv', index=False)
Here, we've updated the model_weights
list to reflect the new ensemble weights. The rest of the code is the same as before.
rewrite the code for following weights: model1: 0.25, model2: 0.2, model3: 0.2, model4: 0.35
Sure, here's the updated code with the new ensembling weights:
pythonCopy codeimport pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Disable parallelism in tokenization to avoid running out of memory
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Define the new TestDataset with the same structure as NewsDataset
class TestDataset(Dataset):
def __init__(self, texts, tokenizer, max_len):
self.texts = texts
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = str(self.texts[idx])
inputs = self.tokenizer.encode_plus(
text,
None,
add_special_tokens=True,
max_length=self.max_len,
padding='max_length',
truncation=True,
return_attention_mask=True,
)
return {
'input_ids': torch.tensor(inputs['input_ids'], dtype=torch.long),
'attention_mask': torch.tensor(inputs['attention_mask'], dtype=torch.long),
}
# Define the ensembling weights for each model
model_weights = [0.25, 0.2, 0.2, 0.35]
# Load the test data from the CSV file
test_data = pd.read_csv('test.csv')
# Load the best model weights for each model
model_names = [
'best_model/1/',
'best_model/2/',
'best_model/3/',
'best_model/4/',
]
tokenizers = [
AutoTokenizer.from_pretrained('abhishek/autonlp-bbc-news-classification-37229289'),
AutoTokenizer.from_pretrained('abhishek/autonlp-bbc-news-classification-37229289'),
AutoTokenizer.from_pretrained('abhishek/autonlp-bbc-news-classification-37229289'),
AutoTokenizer.from_pretrained('roberta-base'),
]
models = [
AutoModelForSequenceClassification.from_pretrained(name, num_labels=8, ignore_mismatched_sizes=True).to('cuda')
for name in model_names
]
# Generate the predictions on the test set using a weighted ensemble
predictions = torch.zeros(len(test_data), 8, device='cuda')
for i, model in enumerate(models):
tokenizer = tokenizers[i]
test_dataset = TestDataset(test_data['text'], tokenizer, max_len=512)
test_loader = DataLoader(test_dataset, batch_size=8, num_workers=4)
model.eval()
with torch.no_grad():
for j, batch in enumerate(test_loader):
input_ids = batch['input_ids'].to('cuda')
attention_mask = batch['attention_mask'].to('cuda')
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
predictions[j*8:(j+1)*8, :] += model_weights[i] * logits
# Convert the predictions to labels and create a submission CSV file
labels = torch.argmax(predictions, dim=1).cpu().numpy()
submission = pd.DataFrame({'id': test_data['id'], 'label': labels})
submission.to_csv('submission.csv', index=False)
Here, we've updated the model_weights
list to reflect the new ensemble weights. The rest of the code is the same as before.
rewrite the code for following weights: model1: 0.25, model2: 0.25, model3: 0.25, model4: 0.25
Sure, here's the updated code with the new ensembling weights:
pythonCopy codeimport pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Disable parallelism in tokenization to avoid running out of memory
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Define the new TestDataset with the same structure as NewsDataset
class TestDataset(Dataset):
def __init__(self, texts, tokenizer, max_len):
self.texts = texts
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = str(self.texts[idx])
inputs = self.tokenizer.encode_plus(
text,
None,
add_special_tokens=True,
max_length=self.max_len,
padding='max_length',
truncation=True,
return_attention_mask=True,
)
return {
'input_ids': torch.tensor(inputs['input_ids'], dtype=torch.long),
'attention_mask': torch.tensor(inputs['attention_mask'], dtype=torch.long),
}
# Define the ensembling weights for each model
model_weights = [0.25, 0.25, 0.25, 0.25]
# Load the test data from the CSV file
test_data = pd.read_csv('test.csv')
# Load the best model weights for each model
model_names = [
'best_model/1/',
'best_model/2/',
'best_model/3/',
'best_model/4/',
]
tokenizers = [
AutoTokenizer.from_pretrained('abhishek/autonlp-bbc-news-classification-37229289'),
AutoTokenizer.from_pretrained('abhishek/autonlp-bbc-news-classification-37229289'),
AutoTokenizer.from_pretrained('abhishek/autonlp-bbc-news-classification-37229289'),
AutoTokenizer.from_pretrained('roberta-base'),
]
models = [
AutoModelForSequenceClassification.from_pretrained(name, num_labels=8, ignore_mismatched_sizes=True).to('cuda')
for name in model_names
]
# Generate the predictions on the test set using a weighted ensemble
predictions = torch.zeros(len(test_data), 8, device='cuda')
for i, model in enumerate(models):
tokenizer = tokenizers[i]
test_dataset = TestDataset(test_data['text'], tokenizer, max_len=512)
test_loader = DataLoader(test_dataset, batch_size=8, num_workers=4)
model.eval()
with torch.no_grad():
for j, batch in enumerate(test_loader):
input_ids = batch['input_ids'].to('cuda')
attention_mask = batch['attention_mask'].to('cuda')
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
predictions[j*8:(j+1)*8, :] += model_weights[i] * logits
# Convert the predictions to labels and create a submission CSV file
labels = torch.argmax(predictions, dim=1).cpu().numpy()
submission = pd.DataFrame({'id': test_data['id'], 'label': labels})
submission.to_csv('submission.csv', index=False)
Here, we've updated the model_weights
list to reflect the new ensemble weights. The rest of the code is the same as before.