

工匠铺教育 49


现在咱们对“lasso matlab”大约比较关怀,姐妹们都想要学习一些“lasso matlab”的相关知识。那么小编也在网络上网罗了一些对于“lasso matlab””的相关知识,希望你们能喜欢,小伙伴们快快来了解一下吧!




在去美国之前,我也想到了可能遇到的困难。一个是数据采集和预处理,数据是机器学习成功的关键因素之一,机器学习教授Neil Lawrence,亚马逊AI团队成员,曾经说过:无论算法有多好,最好的方法是驱动机器学习进步就是获取大量数据。第二是算法的改进。这两个假设也在一个月的科学研究中得到证实。





我遇到了第一个难点:数据的预处理,数据的质量会影响后来SVM的分类效果。所以我花了很多时间在数据处理上。数据的处理分为三个步骤:1.数据已经标准化,因此数据处于同一水平,这将尽可能消除数据的差异。2.去除外来基因和冗余基因,使剩下的基因是突变或突变的基因在去除外来基因时,我选择了信息索引来分类方法,这是考虑方差大小对分类结果影响的一种好方法,这种方法是基于常见的信噪比方法。 removing冗余基因,我选择冗余消除方法的相关系数,确定基因是否需要借助每个基因之间的相似性消除,最终的分类结果表明基因的特征提取功能非常明显。3.我用过对这些基因进行分类的主成分方法,经过这三个步骤,只剩下134个基因,大大减小了维数,得到了预期的结果。














I am honored to have this opportunity to participate in the summer research program at Harvard medical school.

Now I will report to all on this research project.

In recent years, machine learning has been very popular in the field of artificial intelligence, and it is also a new tool for improving prediction level.

My major and machine learningare also very relevant, therefore, before I came to the United States, I had decided to learn some machine learning algorithm as soon as I can, and applying in medical related field, although I did not know own research.

Before leaving, I also think aboutthe possible difficulties in the project: one is the data acquisition and preprocessing, the data is one of the key factors for successful machine learning, machine learning professor Neil Lawrence, a Amazon AI team member, once said: no matter how good is an algorithm, the best way to drive machine learning progressing is to obtain large amounts of data.

The second is the improvement of the algorithm.These two hypotheses have also been proved in a month of scientific research.

After the first meeting, I understand the content of the degrees of freedom is very high, from the selected topic, suppose every steps, data, and even if solving the problem,every step is up to myself, mentor’s rich background can help in every way to me.Although the mentor said I also can choose the subject of the financial sector to study, which is a small kidding?

Ha~ Because I had no medical background, so in the early stage of the research I take some time to replenish the knowledge of tumor and genes, so it can complete the topic selection better, ultimately I determine the project isTumor Gene Identification, research mainly with MATLAB platform.

After determining the research content, I analyze the existing data, the data characteristics of these genes are less samples, but gene dimensionality is high, and these data had been labeled set.

So with these data,and basing on the literature study, I decided to choose the SVM prediction model for training and classification. (I need explain data a little more—data is another classmate send to me, there have been some problems with the data at the beginning, so I added another set of data, all data in project is from TCGA database.)

Then I encountered the first difficulty: the preprocessing of the data, the quality of the data will affect the classification effect of the later SVM, so I spent a lot of time on the data processing.

The processing of data is divided into three steps: First, the data is been normalized, so that the data is in the same level, which will eliminate the differences of data as much as possible.

Second, remove extraneous genes and redundant genes, so that the genes where remained are genes that are either mutated or mutating and not duplicated.

When removing extraneous genes,I chose the information index to classification method, which is a good way to consider the effect of variance size on classification results, this way is based on the common signal-to-noise ratio method.

In removing redundant genes, I chose The correlation coefficient of redundancy elimination method, determininga gene whether need to eliminate with the help of the similarity between each gene, the final classification results shows that the feature extraction function of genes is very obvious.

Third, I used the principal component method to classify the genes, after these three steps, there are only 134 genes left, which greatly reduce the dimension and get the expected result.

After the data preprocessing, the data sets were randomly divided into training setandtest set, first put the training set into the SVM model to determine sample type, the accuracy is as high as 98.8889%, this result is good. So this model cango on forecasting, the accuracy in forecasting test set classification is 99.2063%.

To this end, the study of the project ended and the classification effect is the ideal result.

Actually, before determine using the SVM algorithm, I also tried lasso algorithm and neural network, lasso algorithm and principal component analysis has same effect, dimension reduction, it all have a good effect on extracting feature selection.

The BP neural network is one of the prediction algorithm will often use, but only after a lasso algorithm processing of data still belonged to the noisy and high dimension data, it did not achieved ideal effect in the BP neural network training.

Finally, the SVM algorithm is been found for the characteristics of genetic data, and doing a large number of effective data preprocessing before the application classifier, which can result in a better results.

The main scientific research project for the SVM algorithm improvement concentrate on the data processing, on the feature selection and extraction achieved good effect, and then in the classifier training also achieved good results.

This scientific research project also need to continue to study: first, although the classification result is not bad, but the operation is very time consuming, especially in eliminate gene steps, which need up to an hour, hope it can accelerate the speed in the future.

Second, the application of the data is open, if the model is applied to hospitals, which is a more real complex and large database, whether such processing method can also achieve ideal result or not, so support vector machine (SVM) on gene expression data analysis research have a lot of work to do in the future.

China and the United States have a lot of differences in teaching even if in university.

Although I have participated in some projects with my teacher before in China, what I do more is doingwith teacher's leader step by step;

but this project research degrees of freedom is very high, in order that what the algorithm I want to apply is much more, sometimes I can not find direction, and I overturn the idea in the past many times, always looking for a new and suitable thought, this also lead to a few problems on time management.

What’s more, in communication with my dear mentor, I sometimes feel that I don't have the idea of taking shape to communicate with my tutor, these thought should be changed in my future study life.

Boston is a very attractive city, where science and technology has become a pillar industry, whichnow many cities want to transformation in the direction of development.

Duringthe leisure time, I always can meet some interesting people and thingsin the library or campus, this also let me really looking forward to the future study life.

In addition to the knowledge gains, my oral English has also been improved, which is not only thanks to my mentor but also to my host family.Finally, the project was completed with the help of my dear mentor Lei Gu, Licia, teacher Zhao and teacher Liu.

Thanksall of you very much!



标签: #lasso matlab