龙空技术网

我的「UIUC」科研感受

工匠铺教育 49

前言:

现在咱们对“lasso matlab”大约比较关怀,姐妹们都想要学习一些“lasso matlab”的相关知识。那么小编也在网络上网罗了一些对于“lasso matlab””的相关知识,希望你们能喜欢,小伙伴们快快来了解一下吧!

今天给大家分享

Z同学去UIUC科研的感受

近年来,机器学习在人工智能领域非常流行,也是提高预测水平的新工具。我的专业和机器学习也非常相关,因此在我来到美国之前,我决定学习一些机器学习算法,并应用医学相关领域,虽然我不知道自己的研究。

在去美国之前,我也想到了可能遇到的困难。一个是数据采集和预处理,数据是机器学习成功的关键因素之一,机器学习教授Neil Lawrence,亚马逊AI团队成员,曾经说过:无论算法有多好,最好的方法是驱动机器学习进步就是获取大量数据。第二是算法的改进。这两个假设也在一个月的科学研究中得到证实。

第一次见面后,我了解到这儿的自由度的非常高。从选定的主题,假设每个步骤,数据,甚至解决问题,每一步都取决于我自己。

导师的丰富背景可以帮助我很多。虽然导师说我也可以选择金融界的主题来学习,这是一个小小的开玩笑?哈〜因为我没有医学背景,所以在研究的早期阶段我需要一些时间来补充肿瘤和基因的知识,这样才能更好地完成主题选择,最终我确定该项目是肿瘤基因鉴定,研究主要是用MATLAB平台。

就这样,在这个短暂的圣诞节假期,别的同学要么回国要么出去玩,而我却来到了UIUC继续“学习”。

在确定研究内容后,我分析了现有数据,这些基因的数据特征是样本较少,但基因维数较高,而这些数据已被标记集。因此,根据这些数据,在文献研究的基础上,我决定选择SVM预测模型进行训练和分类。

我遇到了第一个难点:数据的预处理,数据的质量会影响后来SVM的分类效果。所以我花了很多时间在数据处理上。数据的处理分为三个步骤:1.数据已经标准化,因此数据处于同一水平,这将尽可能消除数据的差异。2.去除外来基因和冗余基因,使剩下的基因是突变或突变的基因在去除外来基因时,我选择了信息索引来分类方法,这是考虑方差大小对分类结果影响的一种好方法,这种方法是基于常见的信噪比方法。 removing冗余基因,我选择冗余消除方法的相关系数,确定基因是否需要借助每个基因之间的相似性消除,最终的分类结果表明基因的特征提取功能非常明显。3.我用过对这些基因进行分类的主成分方法,经过这三个步骤,只剩下134个基因,大大减小了维数,得到了预期的结果。

数据预处理后,数据集随机分为训练连续集,首先放入训练集进入SVM模型确定样本类型,准确率高达98.8889%,这个结果是好的。因此该模型能够进行预测,预测试验集分类的准确率为99.2063%。为此,项目研究结束,分类效果理想。

实际上,在确定使用SVM算法之前,我还尝试了套索算法和神经网络。

套索算法和主成分分析具有相同的效果,降维,对提取特征选择都有很好的效果。BP神经网络是其中之一。预测算法经常使用,但只有在套索算法处理数据仍然属于噪声和高维数据后,才能在BP神经网络训练中取得理想的效果。

最后,我们发现了SVM算法的特点。遗传数据,并在应用程序分类器之前进行大量有效的数据预处理,从而可以产生更好的结果。

SVM算法改进的主要科研项目集中在数据处理上,对特征选择和提取取得了良好的效果,然后在分类器训练中也取得了良好的效果。

本科研项目还需要继续研究:第一,虽然分类结果还不错,但操作非常耗时,特别是在消除需要长达一个小时的基因步骤时,希望未来它能够加快速度。其次,数据的应用是开放的,如果该模型应用于医院,这是一个更加真实的复杂和庞大的数据库。无论这种处理方法是否能达到理想的结果,未来支持向量机(SVM)对基因表达数据分析的研究还有很多工作要做。

即使在大学,中国和美国在教学方面也存在很多差异。虽然我在中国之前曾和老师一起参加过一些项目,但我更多的是与老师的领导一步一步做。

但是这个项目的研究自由度非常高,为了我想要应用的算法更多,有时候我找不到方向,而且我在过去多次推翻这个想法,总是在寻找一个新的和合适的我想,这也会导致时间管理方面的一些问题。

更重要的是,在与亲爱的导师交流时,我有时觉得自己没有成形与导师交流的想法,这些想法应该在我的未来的学习生活。

波士顿是一个非常有吸引力的城市,科技已成为支柱产业,现在许多城市都希望向发展方向转变。在闲暇时间,我总能在图书馆或校园里遇到一些有趣的人和事,这也让我真的很期待未来的学习生活。除了知识的提高,我的口语也得到了提高,这不仅仅是感谢我的导师,也感谢我的寄宿家庭。

最后,我的亲爱的导师雷谷,利西亚,赵老师和刘老师帮助完成了这个项目。非常感谢你们!

以上是小编翻译的

想看原版的如下

I am honored to have this opportunity to participate in the summer research program at Harvard medical school.

Now I will report to all on this research project.

In recent years, machine learning has been very popular in the field of artificial intelligence, and it is also a new tool for improving prediction level.

My major and machine learningare also very relevant, therefore, before I came to the United States, I had decided to learn some machine learning algorithm as soon as I can, and applying in medical related field, although I did not know own research.

Before leaving, I also think aboutthe possible difficulties in the project: one is the data acquisition and preprocessing, the data is one of the key factors for successful machine learning, machine learning professor Neil Lawrence, a Amazon AI team member, once said: no matter how good is an algorithm, the best way to drive machine learning progressing is to obtain large amounts of data.

The second is the improvement of the algorithm.These two hypotheses have also been proved in a month of scientific research.

After the first meeting, I understand the content of the degrees of freedom is very high, from the selected topic, suppose every steps, data, and even if solving the problem,every step is up to myself, mentor’s rich background can help in every way to me.Although the mentor said I also can choose the subject of the financial sector to study, which is a small kidding?

Ha~ Because I had no medical background, so in the early stage of the research I take some time to replenish the knowledge of tumor and genes, so it can complete the topic selection better, ultimately I determine the project isTumor Gene Identification, research mainly with MATLAB platform.

After determining the research content, I analyze the existing data, the data characteristics of these genes are less samples, but gene dimensionality is high, and these data had been labeled set.

So with these data,and basing on the literature study, I decided to choose the SVM prediction model for training and classification. (I need explain data a little more—data is another classmate send to me, there have been some problems with the data at the beginning, so I added another set of data, all data in project is from TCGA database.)

Then I encountered the first difficulty: the preprocessing of the data, the quality of the data will affect the classification effect of the later SVM, so I spent a lot of time on the data processing.

The processing of data is divided into three steps: First, the data is been normalized, so that the data is in the same level, which will eliminate the differences of data as much as possible.

Second, remove extraneous genes and redundant genes, so that the genes where remained are genes that are either mutated or mutating and not duplicated.

When removing extraneous genes,I chose the information index to classification method, which is a good way to consider the effect of variance size on classification results, this way is based on the common signal-to-noise ratio method.

In removing redundant genes, I chose The correlation coefficient of redundancy elimination method, determininga gene whether need to eliminate with the help of the similarity between each gene, the final classification results shows that the feature extraction function of genes is very obvious.

Third, I used the principal component method to classify the genes, after these three steps, there are only 134 genes left, which greatly reduce the dimension and get the expected result.

After the data preprocessing, the data sets were randomly divided into training setandtest set, first put the training set into the SVM model to determine sample type, the accuracy is as high as 98.8889%, this result is good. So this model cango on forecasting, the accuracy in forecasting test set classification is 99.2063%.

To this end, the study of the project ended and the classification effect is the ideal result.

Actually, before determine using the SVM algorithm, I also tried lasso algorithm and neural network, lasso algorithm and principal component analysis has same effect, dimension reduction, it all have a good effect on extracting feature selection.

The BP neural network is one of the prediction algorithm will often use, but only after a lasso algorithm processing of data still belonged to the noisy and high dimension data, it did not achieved ideal effect in the BP neural network training.

Finally, the SVM algorithm is been found for the characteristics of genetic data, and doing a large number of effective data preprocessing before the application classifier, which can result in a better results.

The main scientific research project for the SVM algorithm improvement concentrate on the data processing, on the feature selection and extraction achieved good effect, and then in the classifier training also achieved good results.

This scientific research project also need to continue to study: first, although the classification result is not bad, but the operation is very time consuming, especially in eliminate gene steps, which need up to an hour, hope it can accelerate the speed in the future.

Second, the application of the data is open, if the model is applied to hospitals, which is a more real complex and large database, whether such processing method can also achieve ideal result or not, so support vector machine (SVM) on gene expression data analysis research have a lot of work to do in the future.

China and the United States have a lot of differences in teaching even if in university.

Although I have participated in some projects with my teacher before in China, what I do more is doingwith teacher's leader step by step;

but this project research degrees of freedom is very high, in order that what the algorithm I want to apply is much more, sometimes I can not find direction, and I overturn the idea in the past many times, always looking for a new and suitable thought, this also lead to a few problems on time management.

What’s more, in communication with my dear mentor, I sometimes feel that I don't have the idea of taking shape to communicate with my tutor, these thought should be changed in my future study life.

Boston is a very attractive city, where science and technology has become a pillar industry, whichnow many cities want to transformation in the direction of development.

Duringthe leisure time, I always can meet some interesting people and thingsin the library or campus, this also let me really looking forward to the future study life.

In addition to the knowledge gains, my oral English has also been improved, which is not only thanks to my mentor but also to my host family.Finally, the project was completed with the help of my dear mentor Lei Gu, Licia, teacher Zhao and teacher Liu.

Thanksall of you very much!

今天的分享就酱紫

明天见

标签: #lasso matlab