Last modified: 1st November 2017
Prepare a merged dataset
I wanted to obtain many samples – especially European ones – to be able to compare with detail any individual ancient sample.
This is what I made, try this if you want and change (add or remove public datasets) as necessary.
To try the most common ancient and modern samples from the Reich lab, I downloaded the following packages.
You probably want to follow these steps:
wget https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/NearEastPublic.tar.gz https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/ScythianSarmatian.tar.gz https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/MinMyc.tar.gz
tar -zxvf NearEastPublic.tar.gz
tar -zxvf ScythianSarmatian.tar.gz
tar -zxvf MinMyc.tar.gz
Each of these packages contains three files: xxx.geno / xxx.ind / xxx.snp , where xxx is the name of the dataset. From the NearEastPublic package I only selected the HumanOriginsPublic2068 files.
Put them all in the same folder (I called it BED), and within that folder place all extracted files.
I wanted to merge the files and remove some samples, which are not useful for further analyses.
Now you have to decide if you want to just merge the data – in which case I recomend working directly with binaries, which is simpler -, or if you want to add some BED or PED files – then you need to work with these files.
Work directly with binaries
You can merge datasets directly with Eigensoft.
For example, to merge the Minoans and Micenaeans dataset with that of Scythians and Sarmatians, you will need a file like this (I named it mergeMinSS):
mergeit -p mergeMinSS
If you are interested, for example, in comparing it with modern populations, you can use a file like this (I named it mergeMinSSHO):
Now merge them
mergeit -p mergeMinSSHO
NOTES: When merging some files, you might need to add the following line at the end of the file
because there are duplicates, although this is likely to cause problems eventually in the next iterations (or later when analyzing data). You can eliminate them later – once merged – using PLINK. Also, please notice the names and order of the output files. If you use the current standard it will give errors:
Work with PLINK bed files
This is possibly the best option to work directly with files instead of binaries – and probably your only option if you want to add remove samples and make certain changes.
Write convertf files for each dataset, including these settings. The following example is for Mycenaean data, and I named it convertfMinMyc:
Write similar files for all datasets (or data) that you want to use.
convertf -p convertfMinMyc
You can write all merging jobs into a file (say, convertfBED.slurm) to run with Slurm
Now you have .bed, .bim and .fam files for your datasets, and you have to use PLINK to merge them.
If you are using Windows like me, you might want to copy or move them to your Windows machine(for example, using your shared folder).
Write a text file with the following content – including only the “secondary datasets”, not the first one, that you will select as your “main dataset” (MinMyc in this case):
HumanOriginsPublic2068.bed HumanOriginsPublic2068.bim HumanOriginsPublic2068.fam
ScythianSarmatian.bed ScythianSarmatian.bim ScythianSarmatian.fam
Then do the following in a command prompt, with all files (and MinMyc) in the same folder:
plink1.9 --bfile MinMyc --merge-list all_my_files.txt --indiv-sort 0 --make-bed --out MyMerged
NOTE. The flag
--indiv-sort 0is essential if you want to work with labels easily after working with the datasets – and you certainly want to do that.
Depending on your datasets, you will probably need to add
--allow-no-sex so that ambiguous samples are left for analysis. Adding HumanOriginsPublic2068 certainly needs that flag, i.e.
plink1.9 --bfile MinMyc --merge-list all_my_files.txt --allow-no-sex --indiv-sort 0 --make-bed --out MyMerged
NOTE. If using an older version (i.e. PLINK), an “Out of memory” message is likely to pop up, and you probably need to merge datasets one by one, and maybe even split cer
plink --noweb --bfile MinMyc --merge-list all_my_files.txt --make-bed --out MyMerged
plink --noweb --bfile MyMerged --merge-list all_my_files.txt --indiv-sort 0 --make-bed --out MyMerged2
(Remember that in Windows for this example to work plink.exe has to be on the same folder, in this case, or you have to call it from a different directory. In case you are using plink2.exe, you need to call plink2)
At the end of the process, you will have received some error messages regarding ambiguous samples (not clearly male nor female), which are excluded. You can follow instructions to include them, but for the sake of this example, let leave them excluded.
To exclude certain samples (in my case, I removed Chimp and hg19ref), you need a file following the .fam guidelines:
Chimp.REF M Chimp
Href.REF M hg19ref
I named it MyMergedRemove. Then I did:
plink1.9 --bfile MyMerged --remove MyMergedRemove --make-bed --out MyMerged2
Convert from BED to PED
To convert from BED to PED:
plink1.9 --bfile MyMerged --recode --allow-no-sex --out MyMergedPED
PED files tend to grow quite large with merge operations, so you can clean them using minor allele frequency:
plink1.9 --bfile MyMerged --maf 0.05 --hwe --recode --out MyMergedPED
You can clean also with HWE, and remove empty individuals:
plink1.9 --file MyMergedPED --hwe --mind 0.9 --recode --out MyMergedPED_clean
Will remove individuals with more than 90% missing alleles.